diff --git a/readme.md b/readme.md index 31d5442..fed68e3 100644 --- a/readme.md +++ b/readme.md @@ -1,63 +1,73 @@ ## Evaluation of the Medical Segmentation Decathlon Challenge 2018 This repository contains the files for evaluation and analyses of the Medical Segmentation Decathlon (MSD) challenge, conducted at MICCAI 2018. It further serves for reproducing the results of the corresponding paper (https://arxiv.org/pdf/2106.05735.pdf). -### Data preparation +### Prerequisites -Due to privacy reasons, the raw metric values for each participant can **not** be published. Nevertheless, the script `msd-prepare-data.R` contains the code to read and format the original MSD csv data matrix. +The evaluation of the MSD challenge was performed with the programming language `R`. It can be downloaded from TODO. Furthermore, the package `challengeR` was used to conduct the ranking analysis and requires further installation. The instructions can be found here: ([https://github.com/wiesenfa/challengeR](https://github.com/wiesenfa/challengeR)). + +For the current results, `R` version 4.1.0, `challengeR` version 1.0.2 and the IDE `RStudio` were used. + +### 1. Data preparation + +**Due to privacy reasons, the raw metric values for each participant can not be published.** Nevertheless, the script `msd-prepare-data.R` contains the code to read and format the original MSD csv data matrix. The MSD data contains the Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD) values for 19 algorithms participating in the 2018 MSD challenge. The metric scores were computed for two phases: -#### Development phase (phase 1) +#### 1.1 Development phase (phase 1) The development phase included data challenge participants could use to train and test their algorithms aiming for generalization across all tasks. It contained seven tasks, each focusing on one to three target regions: - Brain (Edema, non-enhancing tumor, enhancing tumor) - Heart (Left atrium) - Hippocampus (Anterior, Posterior) - Liver (Liver, Liver tumor) - Lung (Lung Tumor) - Pancreas (Pancreas, Tumor mass) - Prostate (PZ, TZ) -#### Mystery phase (phase 2) +#### 1.2 Mystery phase (phase 2) The three mystery tasks were unknown to the challenge participants and used for final evaluation: - Colon (Colon cancer primaries) - Hepatic Vessel (Vessel, Tumor) - Spleen (Spleen) -### Descriptive statistics +### 2. Descriptive statistics -For every algorithm, phase and metric, descriptive statistics have been calculated in addition to the ranking calculations. The script `msd-descriptive-statistics.R` calculates the Median, 25%- and 75%-Percentiles and the Interquartile Range. The corresponding results were saved as csv files and can be found in the folder `descriptive-statistics`. In the corresponding papers, these values are presented in **Table 3**. +For every algorithm, phase and metric, descriptive statistics have been calculated in addition to the ranking calculations based on the data provided described in section 1 of this readme. The script `msd-descriptive-statistics.R` calculates the Median, 25%- and 75%-Percentiles and the Interquartile Range. The corresponding results were saved as csv files and can be found in the folder `descriptive-statistics`. In the corresponding paper, these values are presented in **Table 3**. -### Dots- and boxplots for the metric values +### 3. Dots- and boxplots for the metric values -Given the hierarchical structure of the data (data > task > region), we show the distribution of raw metric values (DSC and NSD) separately for every task and region. The script `msd-dot-boxplots.R` shows the raw metric scores as jittered dots and aggregated into boxplots for every algorithm for every task (rows) and region (separate boxplots). It uses the library `ggplot2`. The corresponding results were saved as png images and can be found in the folder `dots-and-boxplots`. In the corresponding papers, the dots- and boxplots for the DSC values are presented in **Figures 3 and 4**. +Given the hierarchical structure of the data (data > task > region), we show the distribution of raw metric values (DSC and NSD) separately for every task and region based on the data provided described in section 1 of this readme. The script `msd-dot-boxplots.R` shows the raw metric scores as jittered dots and aggregated into boxplots for every algorithm for every task (rows) and region (separate boxplots). It uses the library `ggplot2`. The corresponding results were saved as png images and can be found in the folder `dots-and-boxplots`. In the corresponding paper, the dots- and boxplots for the DSC values (`Ph1BoxplotsDSC_ue.png` and `Ph2BoxplotsDSC_ue.png`) are presented in **Figures 3 and 4**. -### Rankings for every region +### 4. Rankings for every region -For every region, a significance ranking (see [https://arxiv.org/abs/2106.05735](https://arxiv.org/abs/2106.05735) for details) will be calculated. The script `msd-rankings-subtasks-paper.R` uses the `challengeR` ([https://github.com/wiesenfa/challengeR](https://github.com/wiesenfa/challengeR)) package to compute the rankings. They will be calculated separately for both phases and both metrics. This will result in +For every region, a significance ranking (see [https://arxiv.org/abs/2106.05735](https://arxiv.org/abs/2106.05735) for details) will be calculated based on the data provided described in section 1 of this readme. The script `msd-rankings-subtasks-paper.R` uses the `challengeR` ([https://github.com/wiesenfa/challengeR](https://github.com/wiesenfa/challengeR)) package to compute the rankings. They will be calculated separately for both phases and both metrics. This will result in - 13 DSC and 13 NSD subtask-rankings for phase 1 and - 4 DSC and 4 NSD subtask-rankings for phase 2. -For every phase, these ranks for every algorithm will be averaged. They will be ranked by order to achieve the final rankings. The individual subtask-rankings and mean rankings can be found as csv files in the folder `rankings-per-subtask`. In the corresponding papers, the boxplots for the DSC ranks are presented in **Figure 5**. +For every phase, these ranks for every algorithm will be averaged. They will be ranked by order to achieve the final rankings. The individual subtask-rankings and mean rankings can be found as csv files in the folder `rankings-per-subtask`. -### Boxplots for the rank distribution +### 5. Boxplots for the rank distribution -Based on the results of the previous section (can be found in the folder `rankings-per-subtask`), boxplots were generated for the ranks achieved by every algorithm. The script `msd-rank-boxplots.R` uses the library `ggplot2` to generate boxplots for every phase. The mean ranks are indicated by red dots. The plots can be found in the folder `rank-boxplots`. +Based on the results of the previous section (can be found in the folder `rankings-per-subtask`), boxplots were generated for the ranks achieved by every algorithm. The script `msd-rank-boxplots.R` uses the library `ggplot2` to generate boxplots for every phase. The mean ranks are indicated by red dots. The plots can be found in the folder `rank-boxplots`. In the corresponding paper, the boxplots for all ranks (`RankBoxplots.png`) are presented in **Figure 5**. -### Ranking reports +### 6. Ranking reports -The `challengeR` ([https://github.com/wiesenfa/challengeR](https://github.com/wiesenfa/challengeR)) package offers a pdf report including advanced visualization techniques to analyze ranking uncertainty. The script `msd-generate-ranking-reports.R` generates significance rankings from a challenge object. Furthermore, bootstrapping is applied to investigate the ranking uncertainty. The reports contain various visualization techniques and are generated for both phases and both metrics. The full reports can be found in the folder `ranking-reports`. For the corresponding paper, the following information has been incorporated: +The `challengeR` ([https://github.com/wiesenfa/challengeR](https://github.com/wiesenfa/challengeR)) package offers a pdf report including advanced visualization techniques to analyze ranking uncertainty. The script `msd-generate-ranking-reports.R` generates significance rankings from a challenge object based on the data provided described in section 1 of this readme. Furthermore, bootstrapping is applied to investigate the ranking uncertainty. The reports contain various visualization techniques and are generated for both phases and both metrics. The full reports can be found in the folder `ranking-reports`. For the corresponding paper, the following information has been incorporated: - **Kendall's tau** was computed to determine ranking stability. It was reported in **section 3.3** of the paper for the mystery phase DSC rankings (cf. table in section 3.2 in the report file `msd-phase2-dsc.pdf`). -- The ranking stability for the DSC for **different ranking methods** have been investigated with line plots in **appendix C** (Figures C.6-C.15) for every region for both phases (cf. figures in section 3.4 in the report files `msd-phase1-dsc.pdf` and `msd-phase2-dsc.pdf`) . -- The stacked frequency plots in **appendix D, Figure D.16** show the achieved ranks of the algorithms over 1,000 bootstrap datasets for all tasks of both phases for the DSC (cf. stacked frequency plots in section 4.1.2 in the report files `msd-phase1-dsc.pdf` and `msd-phase2-dsc.pdf`). +- The ranking stability for the DSC for **different ranking methods** have been investigated with line plots in **appendix C** (Figures C.6-C.15) for every region for both phases (cf. figures in section 3.4 in the report files `msd-phase1-dsc.pdf` and `msd-phase2-dsc.pdf`) . For the corresponding paper, the lineplots have been taken from the reports and formatted by the script `msd-generate-ranking-reports.R`, function `generate_subtask_lineplots` (see files in the folder `ranking-reports/lineplots`). +- The stacked frequency plots in **appendix D, Figure D.16** show the achieved ranks of the algorithms over 1,000 bootstrap datasets for all tasks of both phases for the DSC (cf. stacked frequency plots in section 4.1.2 in the report files `msd-phase1-dsc.pdf` and `msd-phase2-dsc.pdf`). For the corresponding paper, the stacked frequency plot has been taken from a report over all tasks for both phases and formatted by the script `msd-generate-ranking-reports.R`, function `generate_frequency_plots` (see file `msd-frequency-plots-dsc.png`). + +### 7. Mean metric values for all participating teams for every region + +For every subtask and every participants, the mean metric values (and median of those across all algorithms) were calculated based on the data provided described in section 1 of this readme. The script `msd-mean-values.R` provides the corresponding code and saves the resulting values per task and metric variant in the `mean-values-per-subtask` folder. In the corresponding paper, the mean DSC values (and median of those across all algorithms) are presented in **appendix C, Tables C.4-C.13** based on the files `subtask-means-1-dsc-TASK.csv`. -### Comparison of the 2018 MSD challenge and the live-decathlon challenge +### 8. Comparison of the 2018 MSD challenge and the live-decathlon challenge -TODO +Due to privacy reasons, the raw metric values for each participant of the live challenge can **not** be published. Nevertheless, the script `msd-rankings-live-vs-2018.R` contains the code to create a dots- and boxplot over the mean DSC values per participant for every region. In the corresponding paper, the boxplots for regions for both, the live and the 2018 challenge, (`LiveVs2018.png`) are presented in **appendix E, Figure E.17**.