diff --git a/tutorial/Overview.Rmd b/tutorial/Overview.Rmd index 7c2e0ca..b581126 100644 --- a/tutorial/Overview.Rmd +++ b/tutorial/Overview.Rmd @@ -1,374 +1,378 @@ --- title: Overwiew of the used methods output: pdf_document: toc: yes toc_depth: '3' github_document: toc: yes toc_depth: 1 editor_options: chunk_output_type: console --- ```{r, echo = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", # fig.path = "README-", fig.width = 9, fig.height = 5, width=160 ) ``` # Introduction This document is meant to be an overview guide of the classes, methods and different steps used in the tutorial scripts, and aims to achieve a deeper understanding of the analysis and visualization toolkit. The overview is divided in sections, following the usage. # Ranking configuration Once the data has been loaded (either manually or using a .csv file), the first thing to do is to create a challenge object. Then, the ranking method will be chosen and configured. ## Define challenge object -Challenges can be single or multi-task. We define a challenge task as a subproblem to be solved in the scope of a challenge for which a dedicated ranking/leaderboard is provided (if any). The assessment method (e.g. metric(s) applied) may vary across different tasks of a challenge. For example, a segmentation challenge may comprise three tasks: +Challenges can be single- or multi-task. We define a challenge task as a subproblem to be solved in the scope of a challenge for which a dedicated ranking/leaderboard is provided (if any). The assessment method (e.g. metric(s) applied) may vary across different tasks of a challenge. For example, a segmentation challenge may comprise three tasks: 1) segmentation of the liver 2) segmentation of the kidney 3) segmentation of the spleen -In the context of the visualization toolkit, we differentiate between challenges that only comprise a single task ("single task challenge") and challenges with multiple tasks that contain different results and rankings ("multitask challenge"). In the latter case, the report can directly be configured across all specified tasks by defining a task column in the data matrix. +In the context of the visualization toolkit, we differentiate between challenges that only comprise a single-task ("single-task challenge") and challenges with multiple tasks with each task containing different results and rankings ("multi-task challenge"). In the latter case, the report can directly be configured across all specified tasks by defining a task column in the data matrix. -The first step is to create a challenge object. The class "challengeR.R" will be used for that purpose, which will be now analysed. +The first step is to create a challenge object. The file "challengeR.R" will be used for that purpose, which will be now analysed. The following code refers to the constructor: ```{r, eval=F, echo=T} # challengeR.R as.challenge=function(object, value, algorithm , case=NULL, + taskName=NULL, by=NULL, annotator=NULL, smallBetter=FALSE, - na.treat=NULL, # optional + na.treat=NULL, check=TRUE) ``` Each parameter corresponds to: -- object: the object that will be returned, in the specific case, challenge -- value: column corresponding to the values of the metrics -- algorithm: column corresponding to the algorithm +- object: the object that will be returned, in the specific case, the data set itself +- value: column corresponding to the values of the metric (only one metric is supported) +- algorithm: column corresponding to the algorithm identifiers - case: column corresponding to the test case identifier -- by: (="task" ), use it when it is a multi-task challenge. If the parameter is not specified, the challenge will be automatically be interpreted as a single task challenge. +- taskName: optional task name (string) for single-task challenges, the parameter will be displayed as titles of plots +- by: (="task" ), use it when it is a multi-task challenge. If the parameter is not specified, the challenge will be automatically be interpreted as a single-task challenge. - annotator: (currently not implemented) specify here if there are more than one annotator -- smallBetter: specify if small metric values will lead to a better performance -- na.treat: specify how missing values (NA) are treated, e.g. set them to the worst possible metric values -- check: computes sanity check if TRUE. The sanity check can be computed for both single and multi-task challenges. It checks missing algorithm performance, and also wether the test cases appear more than once. +- smallBetter: specify if small metric values are indicating a better performance +- na.treat: (optional) specify how missing values (NA) are treated, e.g. set them to the worst possible metric values. There is no need to specify this value because either if the user knows for sure that the data set has no NAs, or if the data set has NAs and rank-then-aggregate is applied. +- check: computes sanity check if TRUE. The sanity check can be computed for both single- and multi-task challenges. It checks missing algorithm performance, and also whether the test cases appear more than once. An example of how to use it (for a multi-task challenge): ```{r, eval=F, echo=T} # challengeR.R challenge=as.challenge(data_matrix, value="value", algorithm="alg_name", case="case", by="task", smallBetter = FALSE) ``` -! Take into account that the code differs for single/multi-task challenges ! +! Take into account that for single-task challenges, the "by" parameter should not be configured ! -For single task challenges, if the data matrix consists of a task column, it is easier to create a subset of the data matrix that only includes the values for that specific task: +For single-task challenges, if the data matrix consists of a task column, it is easier to create a subset of the data matrix that only includes the values for that specific task: ```{r, eval=F, echo=T} dataSubset=subset(data_matrix, task=="TASK_NAME") ``` In this way, "dataSubset" will be used to create the challenge object. ## Configure ranking method The classes "wrapper.R", "aaggregate.R" and "Rank.aggregated.R" are used. In order to configure the ranking methods, the next parameters are considered: - FUN: aggregation function, e.g. mean, median, min, max, or e.g. function(x) quantile(x, probs=0.05) -- na.treat: treatment of`missing data / null values (already specified when the challenge object was created, do we need to specify here again?) either "na.rm" to remove missing data, set missings to numeric value (e.g. 0) or specify a function e.g. function(x) min(x) -- ties.method: a character string specifying how ties (two items that are the same in rank) are treated, see ?base::rank [Strategies for assigning rankings](https://en.wikipedia.org/wiki/Ranking#Strategies_for_assigning_rankings) +- na.treat: treatment of missing data / null values (here needs to be specified again, because it was an optional parameter when the challenge object was created) either "na.rm" to remove missing data, set missings to numeric value (e.g. 0) or specify a function e.g. function(x) min(x) +- ties.method: a character string specifying how ties (two items that are the same in rank) are treated, see ?base::rank or click on [*Strategies for assigning rankings*](https://en.wikipedia.org/wiki/Ranking#Strategies_for_assigning_rankings) for more details - alpha: significance level (only for Significance ranking) - p.adjust.method: method for adjustment for multiple testing, see ?p.adjust Different ranking methods are available: #### Metric-based aggregation -> aggregateThenRank method ```{r, eval=F, echo=T} # wrapper.R aggregateThenRank=function(object,FUN,ties.method = "min",...){ object %>% aggregate(FUN=FUN,...) %>% rank(ties.method = ties.method) } ``` First, (object %>% aggregate), the metric values for each algorithm are aggregated across all cases using the specified aggregation function: ```{r, eval=F, echo=T} # aaggregate.R aggregate.challenge=function(x, FUN=mean, na.treat, alpha=0.05, p.adjust.method="none", parallel=FALSE, progress="none", ...) ``` -Second, (aggregate %>% rank), the aggregated metric values are converted into a ranking list, following the smallBetter argument defined above: +Second, (aggregate %>% rank), the aggregated metric values are converted into a ranking list, following the largeBetter argument defined above: ```{r, eval=F, echo=T} # Rank.aggregated.R rank.aggregated <-function(object, ties.method="min", largeBetter, ...) ``` An example for "aggregate-then-rank" use (takink mean for aggregation): ```{r, eval=F, echo=T} ranking=challenge%>%aggregateThenRank(FUN = mean, na.treat=0, ties.method = "min" ) ``` #### Case-based aggregation -> rankThenAggregate method ```{r, eval=F, echo=T} # wrapper.R rankThenAggregate=function(object, FUN, ties.method = "min" ){ object %>% rank(ties.method = ties.method)%>% aggregate(FUN=FUN) %>% rank(ties.method = ties.method) } ``` -First, (object %>% rank), a ranking will be created for each case across all algorithms. Missing values can be set to the last rank: +First, (object %>% rank), a ranking will be created for each case across all algorithms. Missing values will be assigned to the worst rank: ```{r, eval=F, echo=T} # rrank.R rank.challenge=function(object, x, ties.method="min", ...) ``` Second, (rank %>% aggregate), the ranks per case will be aggregated for each algorithm: ```{r, eval=F, echo=T} # aaggregate.R aggregate.ranked <-function(x, FUN=mean, ... ) ``` Third, (aggregate %>% rank), the previously ranked and aggregated values are converted to a ranking list again: ```{r, eval=F, echo=T} # Rank.aggregated.R rank.aggregated <-function(object, ties.method="min", largeBetter, ...) ``` An example for "rank-then-aggregate" with arguments as above (taking mean for aggregation): ```{r, eval=F, echo=T} ranking=challenge%>%rankThenAggregate(FUN = mean, ties.method = "min" ) ``` #### Significance ranking -> testThenRank method This method is similar to "aggregateThenRank", but having a fixed "significance" function. ```{r, eval=F, echo=T} # wrapper.R testThenRank=function(object,FUN,ties.method = "min",...){ object %>% aggregate(FUN="significance",...) %>% rank(ties.method = ties.method) } ``` First, (object %>% aggregate),the metric values will be aggregated across all cases. In this case, a pairwise comparison between all algorithms will be performed by using statistical tests. For each algorithm, it will be counted how often the specific algorithm is significantly superior to others. This count will be saved as aggregated value: ! No need to specify the function again, it is already set as "significance" ! ```{r, eval=F, echo=T} # aaggregate.R aggregate.challenge=function(x, FUN="significance", na.treat, alpha=0.05, p.adjust.method="none", parallel=FALSE, progress="none", ...) ``` Second, (aggregate %>% rank), the aggregated values are converted to a ranking list: ```{r, eval=F, echo=T} # Rank.aggregated.R rank.aggregated <-function(object, ties.method="min", largeBetter, ...) ``` An example for test-then-rank based on Wilcoxon signed rank test: ```{r, eval=F, echo=T} ranking=challenge%>%testThenRank(alpha=0.05, p.adjust.method="none", na.treat=0, ties.method = "min" ) ``` -# Uncertainity analysis (bootstrapping) +# Uncertainty analysis (bootstrapping) The assessment of stability of rankings across different ranking methods with respect to both sampling variability and variability across tasks is of major importance. In order to investigate ranking stability, the bootstrap approach can be used for a given method. The procedure consists on: 1. Use available data sets to generate N bootstrap datasets 2. Perform ranking on each bootstrap dataset The ranking strategy is performed repeatedly on each bootstrap sample. One bootstrap sample of a task with n test cases consists of n test cases randomly drawn with replacement from this task. A total of b of these bootstrap samples are drawn (e.g., b = 1000). Bootstrap approaches can be evaluated in two ways: either the rankings for each bootstrap sample are evaluated for each algorithm, or the distribution of correlations or pairwise distances between the ranking list based on the full assessment data and based on each bootstrap sample can be explored. -! Note that this step is optional, can be ommited and directly generate the report. ! +! Note that this step is optional, can be omitted and directly generate the report. ! The following method is used to perform ranking on the generated bootstrap datasets: ```{r, eval=F, echo=T} # Bootstrap.R bootstrap.ranked=function(object, nboot, parallel=FALSE, progress="text", ...) ``` - nboot: number of bootstrap datasets to generate - parallel: TRUE when using multiple CPUs -- progress: defines if the progress will be reported and how (?) +- progress: when setting it to "text", indicate the progress of the bootstrapping An example of bootstrapping using multiple CPUs (8 CPUs): ```{r, eval=F, echo=T} library(doParallel) registerDoParallel(cores=8) set.seed(1) ranking_bootstrapped=ranking%>%bootstrap(nboot=1000, parallel=TRUE, progress = "none") stopImplicitCluster() ``` # Report generation -Finally, the report will be generated. For this last step take into account if the uncertainity analysis was performed or not. +Finally, the report will be generated. For this last step take into account if the uncertainty analysis was performed or not. -If the uncertainity analysis was not performed, use: +If the uncertainty analysis was not performed, use: ```{r, eval=F, echo=T} # Report.R report.ranked=function(object, file, title="", colors=default_colors, format="PDF", latex_engine="pdflatex", open=TRUE, ...) ``` -If the uncertainity analysis was performed, use: +If the uncertainty analysis was performed, use: ```{r, eval=F, echo=T} # Report.R report.bootstrap=function(object, file, title="", colors=default_colors, format="PDF", latex_engine="pdflatex", + clean=TRUE, open=TRUE, ...) ``` The report can be generated in different formats: -- file: name of the output file. If the output path is not specified, the working directory is used. If file is specified but does not have a file extension, an extension will be automatically added according to the output format given in *format*. If omitted, the report is created in a temporary folder with file name "report". +- file: name of the output file. If the output path is not specified, the working directory is used. If the file is specified but does not have a file extension, an extension will be automatically added according to the output format given in *format*. If omitted, the report is created in a temporary folder with file name "report". - title: title of the report - colors: color coding for the algorithms across all figures. Can be specified. Change e.g. to colors=viridisLite::inferno which is designed in such a way that it will analytically be perfectly perceptually-uniform, both in regular form and also when converted to black-and-white. It is also designed to be perceived by readers with the most common form of color blindness. See package viridis for further similar functions. - format: output format ("PDF", "HTML" or "Word") - latex_engine: LaTeX engine for producing PDF output ("pdflatex", "lualatex", "xelatex") -- open: optional. Using TRUE will clean intermediate files that are created during rendering. Using FALSE allows to retain intermediate files, such as separate files for each figure. +- clean: optional. Using TRUE will clean intermediate files that are created during rendering. Using FALSE allows to retain intermediate files, such as separate files for each figure. +- open: triggers opening of the report after generation or not -An example of how to generate the report for a *single task* challenge: +An example of how to generate the report for a *single-task* challenge: ```{r, eval=F, echo=T} ranking_bootstrapped %>% report(title="singleTaskChallengeExample", file = "filename", format = "PDF", latex_engine="pdflatex", clean=TRUE ) ``` -! Note that the code differs slightly for single and multi task challenges. ! +! Note that the code differs slightly for single- and multi-task challenges. ! -For multi task challenges consensus ranking (rank aggregation across tasks) has to be given additionally. Consensus relations “synthesize” the information in the elements of a relation ensemble into a single relation, often by minimizing a criterion function measuring how dissimilar consensus candidates are from the (elements of) the ensemble (the so-called “optimization approach”). +For multi-task challenges consensus ranking (rank aggregation across tasks) has to be given additionally. Consensus relations “synthesize” the information in the elements of a relation ensemble into a single relation, often by minimizing a criterion function measuring how dissimilar consensus candidates are from the (elements of) the ensemble (the so-called “optimization approach”). The following method is used: ```{r, eval=F, echo=T} # consensus.R consensus.ranked.list=function(object, method, ...) ``` - method: consensus ranking method, see ?relation_consensus for different methods to derive consensus ranking. An example of computing ranking consensus across tasks, being consensus ranking according to mean ranks across tasks: ```{r, eval=F, echo=T} meanRanks=ranking%>%consensus(method = "euclidean") ``` Generate report as above, but with additional specification of consensus ranking: ```{r, eval=F, echo=T} ranking_bootstrapped %>% report(consensus=meanRanks, title="multiTaskChallengeExample", file = "filename", format = "PDF", latex_engine="pdflatex" ) ``` -# Changes +# Features -- Reports for subsets (top list) of algorithms: Use e.g. `subset(ranking_bootstrapped, top=3) %>% report(...)` (or `subset(ranking, top=3) %>% report(...)` for report without bootstrap results) to only show the top 3 algorithms according to the chosen ranking methods, where `ranking_bootstrapped` and `ranking` objects as defined in the example. Line plot for ranking robustness can be used to check whether algorithms performing well in other ranking methods are excluded. Bootstrapping still takes entire uncertainty into account. Podium plot neglect and ranking heatmap neglect excluded algorithms. Only available for single task challenges (for mutli task challenges not sensible because each task would contain a different sets of algorithms). +- Reports for subsets (top list) of algorithms: Use e.g. `subset(ranking_bootstrapped, top=3) %>% report(...)` (or `subset(ranking, top=3) %>% report(...)` for report without bootstrap results) to only show the top 3 algorithms according to the chosen ranking methods, where `ranking_bootstrapped` and `ranking` objects as defined in the example. Line plot for ranking robustness can be used to check whether algorithms performing well in other ranking methods are excluded. Bootstrapping still takes entire uncertainty into account. Podium plot neglect and ranking heatmap neglect excluded algorithms. Only available for single-task challenges (for mutli task challenges not sensible because each task would contain a different sets of algorithms). - Reports for subsets of tasks: Use e.g. `subset(ranking_bootstrapped, tasks=c("task1", "task2","task3)) %>% report(...)` to restrict report to tasks "task1", "task2","task3. You may want to recompute the consensus ranking before using `meanRanks=subset(ranking, tasks=c("task1", "task2","task3))%>%consensus(method = "euclidean")` # Terms of use Licenced under GPL-3. If you use this software for a publication, cite Wiesenfarth, M., Reinke, A., Landman, B.A., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. (2019). Methods and open-source toolkit for analyzing and visualizing challenge results. *arXiv preprint arXiv:1910.05121*