diff --git a/Readme.Rmd b/Readme.Rmd index a4e6169..41303d5 100644 --- a/Readme.Rmd +++ b/Readme.Rmd @@ -1,517 +1,545 @@ --- title: Methods and open-source toolkit for analyzing and visualizing challenge results output: pdf_document: toc: yes toc_depth: '3' github_document: toc: yes toc_depth: 1 editor_options: chunk_output_type: console --- ```{r, echo = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", # fig.path = "README-", fig.width = 9, fig.height = 5, width=160 ) ``` Note that this is ongoing work (version `r packageVersion("challengeR")`), there may be updates with possibly major changes. *Please make sure that you use the most current version!* Change log at the end of this document. # Introduction The current framework is a tool for analyzing and visualizing challenge results in the field of biomedical image analysis and beyond. Biomedical challenges have become the de facto standard for benchmarking biomedical image analysis algorithms. While the number of challenges is steadily increasing, surprisingly little effort has been invested in ensuring high quality design, execution and reporting for these international competitions. Specifically, results analysis and visualization in the event of uncertainties have been given almost no attention in the literature. Given these shortcomings, the current framework aims to enable fast and wide adoption of comprehensively analyzing and visualizing the results of single-task and multi-task challenges and applying them to a number of simulated and real-life challenges to demonstrate their specific strengths and weaknesses. This approach offers an intuitive way to gain important insights into the relative and absolute performance of algorithms, which cannot be revealed by commonly applied visualization techniques. # Installation Requires R version >= 3.5.2 (https://www.r-project.org). Further, a recent version of Pandoc (>= 1.12.3) is required. RStudio (https://rstudio.com) automatically includes this so you do not need to download Pandoc if you plan to use rmarkdown from the RStudio IDE, otherwise you’ll need to install Pandoc for your platform (https://pandoc.org/installing.html). Finally, if you want to generate a pdf report you will need to have LaTeX installed (e.g. MiKTeX, MacTeX or TinyTeX). To get the current development version of the R package from Github: ```{r, eval=F,R.options,} if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rgraphviz", dependencies = TRUE) devtools::install_github("wiesenfa/challengeR", dependencies = TRUE) ``` If you are asked whether you want to update installed packages and you type "a" for all, you might need administrator rights to update R core packages. You can also try to type "n" for updating no packages. If you are asked "Do you want to install from sources the packages which need compilation? (Yes/no/cancel)", you can safely type "no". If you get *Warning messages* (in contrast to *Error* messages), these might not be problematic and you can try to proceed. # Terms of use Licenced under GPL-3. If you use this software for a publication, cite Wiesenfarth, M., Reinke, A., Landman, B.A., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. (2019). Methods and open-source toolkit for analyzing and visualizing challenge results. *arXiv preprint arXiv:1910.05121* # Usage Each of the following steps have to be run to generate the report: (1) Load package, (2) load data, (3) perform ranking, (4) perform bootstrapping and (5) generation of the report ## 1. Load package Load package ```{r, eval=F} library(challengeR) ``` ## 2. Load data ### Data requirements Data requires the following *columns* * a *task identifier* in case of multi-task challenges. * a *test case identifier* * the *algorithm name* * the *metric value* In case of missing metric values, a missing observation has to be provided (either as blank field or "NA"). For example, in a challenge with 2 tasks, 2 test cases and 2 algorithms, where in task "T2", test case "case2", algorithm "A2" didn't give a prediction (and thus NA or a blank field for missing value is inserted), the data set might look like this: ```{r, eval=T, echo=F,results='asis'} set.seed(1) a=cbind(expand.grid(Task=paste0("T",1:2),TestCase=paste0("case",1:2),Algorithm=paste0("A",1:2)),MetricValue=round(c(runif(7,0,1),NA),3)) print(knitr::kable(a[order(a$Task,a$TestCase,a$Algorithm),],row.names=F)) ``` ### Load data If you have assessment data at hand stored in a csv file (if you want to use simulated data skip the following code line) use ```{r, eval=F, echo=T} data_matrix=read.csv(file.choose()) # type ?read.csv for help ``` This allows to choose a file interactively, otherwise replace *file.choose()* by the file path (in style "/path/to/dataset.csv") in quotation marks. For illustration purposes, in the following simulated data is generated *instead* (skip the following code chunk if you have already loaded data). The data is also stored as "data_matrix.csv" in the repository. ```{r, eval=F, echo=T} if (!requireNamespace("permute", quietly = TRUE)) install.packages("permute") n=50 set.seed(4) strip=runif(n,.9,1) c_ideal=cbind(task="c_ideal", rbind( data.frame(alg_name="A1",value=runif(n,.9,1),case=1:n), data.frame(alg_name="A2",value=runif(n,.8,.89),case=1:n), data.frame(alg_name="A3",value=runif(n,.7,.79),case=1:n), data.frame(alg_name="A4",value=runif(n,.6,.69),case=1:n), data.frame(alg_name="A5",value=runif(n,.5,.59),case=1:n) )) set.seed(1) c_random=data.frame(task="c_random", alg_name=factor(paste0("A",rep(1:5,each=n))), value=plogis(rnorm(5*n,1.5,1)),case=rep(1:n,times=5) ) strip2=seq(.8,1,length.out=5) a=permute::allPerms(1:5) c_worstcase=data.frame(task="c_worstcase", alg_name=c(t(a)), value=rep(strip2,nrow(a)), case=rep(1:nrow(a),each=5) ) c_worstcase=rbind(c_worstcase, data.frame(task="c_worstcase",alg_name=1:5,value=strip2,case=max(c_worstcase$case)+1) ) c_worstcase$alg_name=factor(c_worstcase$alg_name,labels=paste0("A",1:5)) data_matrix=rbind(c_ideal, c_random, c_worstcase) ``` ## 3 Perform ranking ### 3.1 Define challenge object Code differs slightly for single and multi task challenges. In case of a single task challenge use ```{r, eval=F, echo=T} # Use only task "c_random" in object data_matrix dataSubset=subset(data_matrix, task=="c_random") challenge=as.challenge(dataSubset, # Specify how to refer to the task in plots and reports taskName="Task 1", # Specify which column contains the algorithm, # which column contains a test case identifier # and which contains the metric value: algorithm="alg_name", case="case", value="value", # Specify if small metric values are better smallBetter = FALSE) ``` *Instead*, for a multi-task challenge use ```{r, eval=F, echo=T} # Same as above but with 'by="task"' where variable "task" contains the task identifier challenge=as.challenge(data_matrix, by="task", algorithm="alg_name", case="case", value="value", smallBetter = FALSE) ``` ### 3.2 Perform ranking Different ranking methods are available, choose one of them: - for "aggregate-then-rank" use (here: take mean for aggregation) ```{r, eval=F, echo=T} ranking=challenge%>%aggregateThenRank(FUN = mean, # aggregation function, # e.g. mean, median, min, max, # or e.g. function(x) quantile(x, probs=0.05) na.treat=0, # either "na.rm" to remove missing data, # set missings to numeric value (e.g. 0) # or specify a function, # e.g. function(x) min(x) ties.method = "min" # a character string specifying # how ties are treated, see ?base::rank ) ``` - *alternatively*, for "rank-then-aggregate" with arguments as above (here: take mean for aggregation): ```{r, eval=F, echo=T} ranking=challenge%>%rankThenAggregate(FUN = mean, ties.method = "min" ) ``` - *alternatively*, for test-then-rank based on Wilcoxon signed rank test: ```{r, eval=F, echo=T} ranking=challenge%>%testThenRank(alpha=0.05, # significance level p.adjust.method="none", # method for adjustment for # multiple testing, see ?p.adjust na.treat=0, # either "na.rm" to remove missing data, # set missings to numeric value (e.g. 0) # or specify a function, e.g. function(x) min(x) ties.method = "min" # a character string specifying # how ties are treated, see ?base::rank ) ``` ## 4. Perform bootstrapping Perform bootstrapping with 1000 bootstrap samples using one CPU ```{r, eval=F, echo=T} set.seed(1) ranking_bootstrapped=ranking%>%bootstrap(nboot=1000) ``` If you want to use multiple CPUs (here: 8 CPUs), use ```{r, eval=F, echo=T} library(doParallel) registerDoParallel(cores=8) set.seed(1) ranking_bootstrapped=ranking%>%bootstrap(nboot=1000, parallel=TRUE, progress = "none") stopImplicitCluster() ``` ## 5. Generate the report Generate report in PDF, HTML or DOCX format. Code differs slightly for single and multi task challenges. ### 5.1 For single task challenges ```{r, eval=F, echo=T} ranking_bootstrapped %>% report(title="singleTaskChallengeExample", # used for the title of the report file = "filename", format = "PDF", # format can be "PDF", "HTML" or "Word" latex_engine="pdflatex", #LaTeX engine for producing PDF output. Options are "pdflatex", "lualatex", and "xelatex" clean=TRUE #optional. Using TRUE will clean intermediate files that are created during rendering. ) ``` Argument *file* allows for specifying the output file path as well, otherwise the working directory is used. If file is specified but does not have a file extension, an extension will be automatically added according to the output format given in *format*. Using argument *clean=FALSE* allows to retain intermediate files, such as separate files for each figure. If argument "file" is omitted, the report is created in a temporary folder with file name "report". ### 5.1 For multi task challenges Same as for single task challenges, but additionally consensus ranking (rank aggregation across tasks) has to be given. Compute ranking consensus across tasks (here: consensus ranking according to mean ranks across tasks): ```{r, eval=F, echo=T} # See ?relation_consensus for different methods to derive consensus ranking meanRanks=ranking%>%consensus(method = "euclidean") meanRanks # note that there may be ties (i.e. some algorithms have identical mean rank) ``` Generate report as above, but with additional specification of consensus ranking ```{r, eval=F, echo=T} ranking_bootstrapped %>% report(consensus=meanRanks, title="multiTaskChallengeExample", file = "filename", format = "PDF", # format can be "PDF", "HTML" or "Word" latex_engine="pdflatex"#LaTeX engine for producing PDF output. Options are "pdflatex", "lualatex", and "xelatex" ) ``` # Troubleshooting In this section are compiled issues that the users reported. ### RStudio specific #### - Warnings while installing the Github repository ##### Error: While trying to install the current version of the repository: ```{r, eval=F, echo=T} devtools::install_github("wiesenfa/challengeR", dependencies = TRUE) ``` The following warning showed up in the output: ```{r, eval=F, echo=T} WARNING: Rtools is required to build R packages, but is not currently installed. ``` Therefore, Rtools was installed via a separate executable: https://cran.r-project.org/bin/windows/Rtools/ and the warning disappeared. ##### Solution: Actually there is no need of installing Rtools, it is not really used in the toolkit. Insted, choose not to install it when it is asked. See comment in the installation section: “If you are asked whether you want to update installed packages and you type “a” for all, you might need administrator rights to update R core packages. You can also try to type “n” for updating no packages. If you are asked “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)”, you can safely type “no”.” #### - Unable to install the current version of the tool from Github ##### Error: While trying the current version of the tool from github, it was unable to install. The error message was: ```{r, eval=F, echo=T} byte-compile and prepare package for lazy loading Error: (converted from warning) package 'ggplot2' was built under R version 3.6.3 Execution halted ERROR: lazy loading failed for package 'challengeR' * removing 'C:/Users/.../Documents/R/win-library/3.6/challengeR' * restoring previous 'C:/Users/.../Documents/R/win-library/3.6/challengeR' Error: Failed to install 'challengeR' from GitHub: (converted from warning) installation of package 'C:/Users/.../AppData/Local/Temp/Rtmp615qmV/file4fd419555eb4/challengeR_0.3.1.tar.gz' had non-zero exit status ``` The problem was that some of the packages that were built under R3.6.1 had been updated, but the current installed version was still R3.6.1. ##### Solution: The solution was to update R3.6.1 to R3.6.3. Another way would have been to reset the single packages to the versions built under R3.6.1 #### - Unable to install the toolkit from Github ##### Error: While trying the current version of the tool from github, it was unable to install. ```{r, eval=F, echo=T} devtools::install_github("wiesenfa/challengeR", dependencies = TRUE) ``` The error message was: ```{r, eval=F, echo=T} Error: .onLoad failed in loadNamespace() for 'pkgload', details: call: loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) error: there is no package called ‘backports’ ``` The problem was that the packages 'backports' had not been installed. ##### Solution: The solution was to install 'backports' manually. ```{r, eval=F, echo=T} install.packages("backports") ``` #### - Unable to install R ##### Error: While trying to install the package in the R, after running the following commands: ```{r, eval=F, echo=T} if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rgraphviz", dependencies = TRUE) devtools::install_github("wiesenfa/challengeR", dependencies = TRUE) ``` The error message was: ```{r, eval=F, echo=T} ERROR: 1: In file(con, "r") : URL 'https://bioconductor.org/config.yaml': status was 'SSL connect error' 2: packages ‘BiocVersion’, ‘Rgraphviz’ are not available (for R version 3.6.1) ``` ##### Solution: The solution was to restart RStudio. #### - Incorrect column order ##### Error: When naming the columns "task" and "case", R was confused because the arguments in the challenge object are also called like this and it produced the following error: ```{r, eval=F, echo=T} Error in table(object[[task]][[algorithm]], object[[task]][[case]]) : all arguments must have the same length ``` ##### Solution: The solution was to rename the columns. +#### - Wrong versions of packages +##### Error: + +While running this command : + +```{r, eval=F, echo=T} + devtools::install_github("wiesenfa/challengeR", dependencies = TRUE) +``` + +I had the following errors : +- Error : the package 'purrr' has been compiled with version of R 3.6.3 +- Error : the package 'ggplot2' has been compiled with version of R 3.6.3 +- Error in loadNamespace(j <- i[[L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) + namespace 'glue' 1.3.1 is already loaded, but >= 1.3.2 is required + +##### Solution: + +To solve the issue I changed the versions of the packages. +I had the following versions : +- purrr 0.3.4 +- ggplot2 3.3.2 +- glue 1.3.1 + +I moved to the following ones : +- purrr 0.3.3 +- ggplot2 3.3.0 +- glue 1.4.2 + ### Related to MikText #### - Missing packages ##### Error: While generating the PDF with Miktext (2.9), the following error showed up: ```{r, eval=F, echo=T} fatal pdflatex - gui framework cannot be initialized ``` There is an issue with installing missing packages in LaTeX. ##### Solution: Open your MiKTeX Console --> Settings, select "Always install missing packages on-the-fly". Then generate the report. Once the report is generated, you can reset the settings to your preferred ones. #### - Unable to generate report ##### Error: While generating the PDF with Miktext (2.9): ```{r, eval=F, echo=T} ranking_bootstrapped %>% report(title="singleTaskChallengeExample", # used for the title of the report file = "filename", format = "PDF", # format can be "PDF", "HTML" or "Word" latex_engine="pdflatex", #LaTeX engine for producing PDF output. Options are "pdflatex", "lualatex", and "xelatex" clean=TRUE #optional. Using TRUE will clean intermediate files that are created during rendering. ) ``` The following error showed up: ```{r, eval=F, echo=T} output file: filename.knit.md "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS filename.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output filename.tex --self-contained --number-sections --highlight-style tango --pdf-engine pdflatex --variable graphics --lua-filter "C:/Users/adm/Documents/R/win-library/3.6/rmarkdown/rmd/lua/pagebreak.lua" --lua-filter "C:/Users/adm/Documents/R/win-library/3.6/rmarkdown/rmd/lua/latex-div.lua" --variable "geometry:margin=1in" Error: LaTeX failed to compile filename.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. Warning message: In system2(..., stdout = if (use_file_stdout()) f1 else FALSE, stderr = f2) : '"pdflatex"' not found ``` ##### Solution: The solution was to restart RStudio. # Changes #### Version 0.3.3 - Force line break to avoid that authors exceed the page in generated PDF reports #### Version 0.3.2 - Correct names of authors #### Version 0.3.1 - Refactoring #### Version 0.3.0 - Major bug fix release #### Version 0.2.5 - Bug fixes #### Version 0.2.4 - Automatic insertion of missings #### Version 0.2.3 - Bug fixes - Reports for subsets (top list) of algorithms: Use e.g. `subset(ranking_bootstrapped, top=3) %>% report(...)` (or `subset(ranking, top=3) %>% report(...)` for report without bootstrap results) to only show the top 3 algorithms according to the chosen ranking methods, where `ranking_bootstrapped` and `ranking` objects as defined in the example. Line plot for ranking robustness can be used to check whether algorithms performing well in other ranking methods are excluded. Bootstrapping still takes entire uncertainty into account. Podium plot neglect and ranking heatmap neglect excluded algorithms. Only available for single task challenges (for mutli task challenges not sensible because each task would contain a different sets of algorithms). - Reports for subsets of tasks: Use e.g. `subset(ranking_bootstrapped, tasks=c("task1", "task2","task3)) %>% report(...)` to restrict report to tasks "task1", "task2","task3. You may want to recompute the consensus ranking before using `meanRanks=subset(ranking, tasks=c("task1", "task2","task3))%>%consensus(method = "euclidean")` #### Version 0.2.1 - Introduction in reports now mentions e.g. ranking method, number of test cases,... - Function `subset()` allows selection of tasks after bootstrapping, e.g. `subset(ranking_bootstrapped,1:3)` - `report()` functions gain argument `colors` (default: `default_colors`). Change e.g. to `colors=viridisLite::inferno` which "is designed in such a way that it will analytically be perfectly perceptually-uniform, both in regular form and also when converted to black-and-white. It is also designed to be perceived by readers with the most common form of color blindness." See package `viridis` for further similar functions. #### Version 0.2.0 - Improved layout in case of many algorithms and tasks (while probably still not perfect) - Consistent coloring of algorithms across figures - `report()` function can be applied to ranked object before bootstrapping (and thus excluding figures based on bootstrapping), i.e. in the example `ranking %>% report(...)` - bug fixes # Team The developer team includes members from both Computer Assisted Medical Interventions (CAMI) and Biostatistics Division from the German Cancer Research Center (DKFZ): - Manuel Wiesenfarth - Annette Kopp-Schneider - Annika Reinke - Matthias Eisenmann - Laura Aguilera Saiz - Lena Maier-Hein # Reference Wiesenfarth, M., Reinke, A., Landman, B.A., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. (2019). Methods and open-source toolkit for analyzing and visualizing challenge results. *arXiv preprint arXiv:1910.05121* diff --git a/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd b/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd index 9ba7e57..1b6ac2c 100644 --- a/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd +++ b/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd @@ -1,69 +1,69 @@ ### Ranking stability: Ranking variability via bootstrap approach Blob plot of bootstrap results over the different tasks separated by algorithm allows another perspective on the assessment data. This gives deeper insights into the characteristics of tasks and the ranking uncertainty of the algorithms in each task. \bigskip ```{r blobplot_bootstrap_byAlgorithm,fig.width=7,fig.height = 5} #stabilityByAlgorithm.bootstrap.list if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ stabilityByAlgorithm(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4, single = F) + scale_color_manual(values=cols) } else { pl=stabilityByAlgorithm(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4, single = T) for (i in 1:length(pl)) print(pl[[i]] + scale_color_manual(values=cols) + guides(size = guide_legend(title="%"),color="none") ) } ``` An alternative representation is provided by a stacked frequency plot of the observed ranks, separated by algorithm. Observed ranks across bootstrap samples are -displayed with colouring according to task. For algorithms that +displayed with coloring according to task. For algorithms that achieve the same rank in different tasks for the full assessment data set, vertical lines are on top of each other. Vertical lines allow to compare the achieved rank of each algorithm over different tasks. \bigskip ```{r stackedFrequencies_bootstrap_byAlgorithm,fig.width=7,fig.height = 5} #stabilityByAlgorithm.bootstrap.list if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ stabilityByAlgorithm(boot_object, ordering=ordering_consensus, stacked = TRUE, single = F) } else { pl=stabilityByAlgorithm(boot_object, ordering=ordering_consensus, stacked = TRUE, single = T) print(pl) } ``` diff --git a/inst/appdir/characterizationOfTasksBootstrapping.Rmd b/inst/appdir/characterizationOfTasksBootstrapping.Rmd index 51f6438..93176c9 100644 --- a/inst/appdir/characterizationOfTasksBootstrapping.Rmd +++ b/inst/appdir/characterizationOfTasksBootstrapping.Rmd @@ -1,49 +1,49 @@ ### Visualizing bootstrap results To investigate which tasks separate algorithms well (i.e., lead to a stable ranking), a blob plot is recommended. Bootstrap results can be shown in a blob plot showing one plot for each task. In this view, the spread of the blobs for each algorithm can be compared across tasks. Deviations from the diagonal indicate deviations from the consensus ranking (over tasks). Specifically, if rank distribution of an algorithm is consistently below the diagonal, the algorithm performed better in this task than on average across tasks, while if the rank distribution of an algorithm is consistently above the diagonal, the algorithm performed worse in this task than on average across tasks. At the bottom -of each panel, ranks for each algorithm in the tasks is provided. +of each panel, ranks for each algorithm in the tasks are provided. Same as in Section \ref{blobByTask} but now ordered according to consensus. \bigskip ```{r blobplot_bootstrap_byTask,fig.width=9, fig.height=9} #stabilityByTask.bootstrap.list if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ stabilityByTask(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4) + scale_color_manual(values=cols) } else { pl=list() for (subt in names(boot_object$bootsrappedRanks)){ a=list(bootsrappedRanks=list(boot_object$bootsrappedRanks[[subt]]), matlist=list(boot_object$matlist[[subt]])) names(a$bootsrappedRanks)=names(a$matlist)=subt class(a)="bootstrap.list" r=boot_object$matlist[[subt]] pl[[subt]]=stabilityByTask(a, max_size = 9, ordering=ordering_consensus, size.ranks=.25*theme_get()$text$size, size=4, shape=4) + scale_color_manual(values=cols) + ggtitle(subt) } print(pl) } ``` \ No newline at end of file diff --git a/inst/appdir/overviewMultiTaskBootstrapping.Rmd b/inst/appdir/overviewMultiTaskBootstrapping.Rmd index 16b9c33..dbbf273 100644 --- a/inst/appdir/overviewMultiTaskBootstrapping.Rmd +++ b/inst/appdir/overviewMultiTaskBootstrapping.Rmd @@ -1,3 +1,3 @@ * Visualization of assessment data: Dot- and boxplots, podium plots and ranking heatmaps * Visualization of ranking stability: Blob plots, violin plots and significance maps, line plots -* Visualization of cross-task insights: Blob plots, stacked frequency plots, dendrogram, network plot +* Visualization of cross-task insights: Blob plots, stacked frequency plots, dendrograms, network plots diff --git a/inst/appdir/reportMultiple.Rmd b/inst/appdir/reportMultiple.Rmd index 6f8ccad..5f32b4b 100644 --- a/inst/appdir/reportMultiple.Rmd +++ b/inst/appdir/reportMultiple.Rmd @@ -1,409 +1,409 @@ --- params: object: NA colors: NA name: NULL consensus: NA isMultiTask: NA bootstrappingEnabled: NA fig.format: NULL dpi: NULL title: "Benchmarking report for `r params$name` " author: "created by challengeR v`r packageVersion('challengeR')`" date: "`r Sys.setlocale('LC_TIME', 'English'); format(Sys.time(), '%d %B, %Y')`" editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} options(width=80) #out.format <- knitr::opts_knit$get("out.format") out.format <- knitr::opts_knit$get("rmarkdown.pandoc.to") img_template <- switch( out.format, docx = list("img-params"=list(dpi=150, fig.width=6, fig.height=6, out.width="504px", out.height="504px")), { # default list("img-params"=list( fig.width=7, fig.height = 3, dpi=300)) } ) knitr::opts_template$set( img_template ) knitr::opts_chunk$set(echo = F) # ,#fig.width=7,fig.height = 3,dpi=300, if (out.format != "docx") knitr::opts_chunk$set(fig.align = "center") if (!is.null(params$fig.format)) knitr::opts_chunk$set(dev = params$fig.format) # can be vector, e.g. fig.format=c('jpeg','png', 'pdf') if (!is.null(params$dpi)) knitr::opts_chunk$set(dpi = params$dpi) theme_set(theme_light()) isMultiTask = params$isMultiTask bootstrappingEnabled = params$bootstrappingEnabled ``` ```{r } object = params$object if (isMultiTask) { ordering_consensus=names(params$consensus) } else { ordering_consensus=names(sort(t(object$matlist[[1]][,"rank",drop=F])["rank",])) } color.fun=params$colors ``` ```{r } challenge_multiple=object$data ranking.fun=object$FUN cols_numbered=cols=color.fun(length(ordering_consensus)) names(cols)=ordering_consensus names(cols_numbered)= paste(1:length(cols),names(cols)) if (bootstrappingEnabled) { boot_object = params$object challenge_multiple=boot_object$data ranking.fun=boot_object$FUN object=challenge_multiple%>%ranking.fun object$FUN.list = boot_object$FUN.list object$fulldata=boot_object$fulldata # only not NULL if subset of algorithms used cols_numbered=cols=color.fun(length(ordering_consensus)) names(cols)=ordering_consensus names(cols_numbered)= paste(1:length(cols),names(cols)) } ``` This document presents a systematic report on the benchmark study "`r params$name`". Input data comprises raw metric values for all algorithms and cases. Generated plots are: ```{r, child=if (!isMultiTask && !bootstrappingEnabled) system.file("appdir", "overviewSingleTaskNoBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (!isMultiTask && bootstrappingEnabled) system.file("appdir", "overviewSingleTaskBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (isMultiTask && !bootstrappingEnabled) system.file("appdir", "overviewMultiTaskNoBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "overviewMultiTaskBootstrapping.Rmd", package="challengeR")} ``` Details can be found in Wiesenfarth et al. (2019). ```{r,results='asis'} if (isMultiTask) { cat("# Rankings\n") } else { cat("# Ranking") } ``` Algorithms within a task are ranked according to the following ranking scheme: ```{r,results='asis'} a=( lapply(object$FUN.list[1:2],function(x) { if (!is.character(x)) return(paste0("aggregate using function ", paste(gsub("UseMethod","", deparse(functionBody(x))), collapse=" ") )) else if (x=="rank") return(x) else return(paste0("aggregate using function ",x)) })) cat("    *",paste0(a,collapse=" then "),"*",sep="") if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="significance") cat("\n\n Column 'prop_significance' is equal to the number of pairwise significant test results for a given algorithm divided by the number of algorithms.") ``` ```{r,results='asis'} if (isMultiTask) { cat("Ranking for each task:\n") for (t in 1:length(object$matlist)){ cat("\n",names(object$matlist)[t],": ") n.cases=nrow(challenge_multiple[[t]])/length(unique(challenge_multiple[[t]][[attr(challenge_multiple,"algorithm")]])) numberOfAlgorithms <- length(levels(challenge_multiple[[t]][[attr(challenge_multiple, "algorithm")]])) cat("\nThe analysis is based on", numberOfAlgorithms, "algorithms and", n.cases, "cases.", attr(object$data,"n.missing")[[t]], "missing cases have been found in the data set. ") if (nrow(attr(object$data,"missingData")[[t]])>0) { if(attr(object$data,"n.missing")[[t]]==0 ) cat("However, ") else if(attr(object$data,"n.missing")[[t]]>0 ) cat("Additionally, ") cat("performance of not all algorithms has been observed for all cases in task '", names(object$matlist)[t], "'. Therefore, missings have been inserted in the following cases:") print(knitr::kable(as.data.frame(attr(object$data,"missingData")[[t]]))) } if (nrow(attr(object$data,"missingData")[[t]])>0 | attr(object$data,"n.missing")[[t]]>0) { if (is.numeric(attr(object$data,"na.treat"))) cat("All missings have been replaced by values of", attr(object$data,"na.treat"),".\n") else if (is.character(attr(object$data,"na.treat")) && attr(object$data,"na.treat")=="na.rm") cat("All missings have been removed.") else if (is.function(attr(object$data,"na.treat"))) { cat("Missings have been replaced using function ") print(attr(object$data,"na.treat")) } else if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="rank") cat("Missings lead to the algorithm ranked last for the missing case.") } x=object$matlist[[t]] print(knitr::kable(x[order(x$rank),])) } } else { n.cases=nrow(challenge_multiple[[1]])/length(unique(challenge_multiple[[1]][[attr(challenge_multiple,"algorithm")]])) # Is subset of algorithms used? if (!is.null(object$fulldata[[1]])) { cat("The top ", length(levels(challenge_multiple[[1]][[attr(challenge_multiple, "algorithm")]])), " out of ", length(levels(object$fulldata[[1]][[attr(challenge_multiple, "algorithm")]])), " algorithms are considered.\n") cat("\nThe analysis is based on", n.cases, "cases. ") } else { cat("\nThe analysis is based on", length(levels(challenge_multiple[[1]][[attr(challenge_multiple, "algorithm")]])), "algorithms and", n.cases, "cases. ") } cat(attr(object$data,"n.missing")[[1]], "missing cases have been found in the data set. ") if (nrow(attr(object$data,"missingData")[[1]])>0) { if(attr(object$data,"n.missing")[[1]]==0 ) cat("However, ") else if(attr(object$data,"n.missing")[[1]]>0 ) cat("Additionally, ") cat("performance of not all algorithms has been observed for all cases. Therefore, missings have been inserted in the following cases:") print(knitr::kable(as.data.frame(attr(object$data,"missingData")[[1]]))) } if (nrow(attr(object$data,"missingData")[[1]])>0 | attr(object$data,"n.missing")[[1]]>0) { if (is.numeric(attr(object$data,"na.treat"))) cat("All missings have been replaced by values of", attr(object$data,"na.treat"),".\n") else if (is.character(attr(object$data,"na.treat")) && attr(object$data,"na.treat")=="na.rm") cat("All missings have been removed.") else if (is.function(attr(object$data,"na.treat"))) { cat("Missings have been replaced using function ") print(attr(object$data,"na.treat")) } else if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="rank") cat("Missings lead to the algorithm ranked last for the missing case.") } cat("\n\nRanking:") x=object$matlist[[1]] print(knitr::kable(x[order(x$rank),])) } ``` \bigskip ```{r, child=if (isMultiTask) system.file("appdir", "consensusRanking.Rmd", package="challengeR")} ``` # Visualization of raw assessment data ```{r,results='asis'} if (isMultiTask) { cat("The algorithms are ordered according to the computed ranks for each task.") } ``` ## Dot- and boxplot *Dot- and boxplots* for visualizing raw assessment data separately for each algorithm. Boxplots representing descriptive statistics over all cases (median, quartiles and outliers) are combined with horizontally jittered dots representing individual cases. \bigskip ```{r boxplots} boxplot(object, size=.8) ``` ## Podium plot *Podium plots* (see also Eugster et al., 2008) for visualizing raw assessment data. Upper part (spaghetti plot): Participating algorithms are color-coded, and each colored dot in the plot represents a metric value achieved with the respective algorithm. The actual metric value is encoded by the y-axis. Each podium (here: $p$=`r length(ordering_consensus)`) represents one possible rank, ordered from best (1) to last (here: `r length(ordering_consensus)`). The assignment of metric values (i.e. colored dots) to one of the podiums is based on the rank that the respective algorithm achieved on the corresponding case. Note that the plot part above each podium place is further subdivided into $p$ "columns", where each column represents one participating algorithm (here: $p=$ `r length(ordering_consensus)`). Dots corresponding to identical cases are connected by a line, leading to the shown spaghetti structure. Lower part: Bar charts represent the relative frequency for each algorithm to achieve the rank encoded by the podium place. ```{r, include=FALSE, fig.keep="none",dev=NULL} plot.new() algs=ordering_consensus l=legend("topright", paste0(1:length(algs),": ",algs), lwd = 1, cex=1.4,seg.len=1.1, title="Rank: Alg.", plot=F) w <- grconvertX(l$rect$w, to='ndc') - grconvertX(0, to='ndc') h<- grconvertY(l$rect$h, to='ndc') - grconvertY(0, to='ndc') addy=max(grconvertY(l$rect$h,"user","inches"),6) ``` ```{r podium,eval=T,fig.width=12, fig.height=addy} #c(bottom, left, top, right op<-par(pin=c(par()$pin[1],6), omd=c(0, 1-w, 0, 1), mar=c(par('mar')[1:3], 0)+c(-.5,0.5,-.5,0), cex.axis=1.5, cex.lab=1.5, cex.main=1.7) oh=grconvertY(l$rect$h,"user","lines")-grconvertY(6,"inches","lines") if (oh>0) par(oma=c(oh,0,0,0)) set.seed(38) podium(object, col=cols, lines.show = T, lines.alpha = .4, dots.cex=.9, ylab="Metric value", layout.heights=c(1,.35), legendfn = function(algs, cols) { legend(par('usr')[2], par('usr')[4], xpd=NA, paste0(1:length(algs),": ",algs), lwd = 1, col = cols, bg = NA, cex=1.4, seg.len=1.1, title="Rank: Alg.") } ) par(op) ``` ## Ranking heatmap *Ranking heatmaps* for visualizing raw assessment data. Each cell $\left( i, A_j \right)$ shows the absolute frequency of cases in which algorithm $A_j$ achieved rank $i$. \bigskip ```{r rankingHeatmap,fig.width=9, fig.height=9,out.width='70%'} rankingHeatmap(object) ``` # Visualization of ranking stability ```{r, child=if (bootstrappingEnabled) system.file("appdir", "visualizationBlobPlots.Rmd", package="challengeR")} ``` ```{r, child=if (bootstrappingEnabled) system.file("appdir", "visualizationViolinPlots.Rmd", package="challengeR")} ``` ## *Significance maps* for visualizing ranking stability based on statistical significance *Significance maps* depict incidence matrices of -pairwise significant test results for the one-sided Wilcoxon signed rank test at a 5\% significance level with adjustment for multiple testing according to Holm. Yellow shading indicates that metric values of the algorithm on the x-axis were significantly superior to those from the algorithm on the y-axis, blue color indicates no significant difference. +pairwise significant test results for the one-sided Wilcoxon signed rank test at a 5\% significance level with adjustment for multiple testing according to Holm. Yellow shading indicates that metric values from the algorithm on the x-axis were significantly superior to those from the algorithm on the y-axis, blue color indicates no significant difference. \bigskip ```{r significancemap,fig.width=6, fig.height=6,out.width='200%'} significanceMap(object,alpha=0.05,p.adjust.method="holm") ``` ## Ranking robustness to ranking methods -*Line plots* for visualizing rankings robustness across different ranking methods. Each algorithm is represented by one colored line. For each ranking method encoded on the x-axis, the height of the line represents the corresponding rank. Horizontal lines indicate identical ranks for all methods. +*Line plots* for visualizing ranking robustness across different ranking methods. Each algorithm is represented by one colored line. For each ranking method encoded on the x-axis, the height of the line represents the corresponding rank. Horizontal lines indicate identical ranks for all methods. \bigskip ```{r lineplot,fig.width=8, fig.height=6,out.width='95%'} if (length(object$matlist)<=6 &nrow((object$matlist[[1]]))<=10 ){ methodsplot(challenge_multiple, ordering = ordering_consensus, na.treat=object$call[[1]][[1]]$na.treat) + scale_color_manual(values=cols) }else { x=challenge_multiple for (subt in names(challenge_multiple)){ dd=as.challenge(x[[subt]], value=attr(x,"value"), algorithm=attr(x,"algorithm") , case=attr(x,"case"), annotator = attr(x,"annotator"), by=attr(x,"by"), smallBetter = attr(x,"smallBetter"), na.treat=object$call[[1]][[1]]$na.treat ) print(methodsplot(dd, ordering = ordering_consensus) + ggtitle(subt) + scale_color_manual(values=cols) ) } } ``` ```{r, child=if (isMultiTask) system.file("appdir", "visualizationAcrossTasks.Rmd", package="challengeR")} ``` # References Wiesenfarth, M., Reinke, A., Landman, B.A., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. (2019). Methods and open-source toolkit for analyzing and visualizing challenge results. *arXiv preprint arXiv:1910.05121* M. J. A. Eugster, T. Hothorn, and F. Leisch, “Exploratory and inferential analysis of benchmark experiments,” Institut fuer Statistik, Ludwig-Maximilians-Universitaet Muenchen, Germany, Technical Report 30, 2008. [Online]. Available: http://epub.ub.uni-muenchen.de/4134/. diff --git a/inst/appdir/visualizationAcrossTasks.Rmd b/inst/appdir/visualizationAcrossTasks.Rmd index f0d9fce..f37b5a1 100644 --- a/inst/appdir/visualizationAcrossTasks.Rmd +++ b/inst/appdir/visualizationAcrossTasks.Rmd @@ -1,115 +1,115 @@ # Visualization of cross-task insights The algorithms are ordered according to consensus ranking. ## Characterization of algorithms ### Ranking stability: Variability of achieved rankings across tasks Algorithms are color-coded, and the area of each blob at position $\left( A_i, \text{rank } j \right)$ is proportional to the relative frequency $A_i$ achieved rank $j$ across multiple tasks. The median rank for each algorithm is indicated by a black cross. This way, the distribution of ranks across tasks can be intuitively visualized. \bigskip ```{r blobplot_raw,fig.width=9, fig.height=9} #stability.ranked.list stability(object,ordering=ordering_consensus,max_size=9,size=8,shape=4)+ scale_color_manual(values=cols) ``` ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "characterizationOfAlgorithmsBootstrapping.Rmd", package="challengeR")} ``` ## Characterization of tasks ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "characterizationOfTasksBootstrapping.Rmd", package="challengeR")} ``` ### Cluster Analysis -Dendrogram from hierarchical cluster analysis} and \textit{network-type graphs} for assessing the similarity of tasks based on challenge rankings. +Dendrogram from hierarchical cluster analysis and \textit{network-type graphs} for assessing the similarity of tasks based on challenge rankings. A dendrogram is a visualization approach based on hierarchical clustering. It depicts clusters according to a chosen distance measure (here: Spearman's footrule) as well as a chosen agglomeration method (here: complete and average agglomeration). \bigskip ```{r dendrogram_complete, fig.width=6, fig.height=5,out.width='60%'} if (length(object$matlist)>2) { dendrogram(object, dist = "symdiff", method="complete") } else cat("\nCluster analysis only sensible if there are >2 tasks.\n\n") ``` \bigskip ```{r dendrogram_average, fig.width=6, fig.height=5,out.width='60%'} if (length(object$matlist)>2) dendrogram(object, dist = "symdiff", method="average") ``` - + diff --git a/inst/appdir/visualizationViolinPlots.Rmd b/inst/appdir/visualizationViolinPlots.Rmd index 88e075e..7061880 100644 --- a/inst/appdir/visualizationViolinPlots.Rmd +++ b/inst/appdir/visualizationViolinPlots.Rmd @@ -1,9 +1,9 @@ ## *Violin plot* for visualizing ranking stability based on bootstrapping \label{violin} -The ranking list based on the full assessment data is pairwisely compared with the ranking lists based on the individual bootstrap samples (here $b=$ `r ncol(boot_object$bootsrappedRanks[[1]])` samples). For each pair of rankings, Kendall's $\tau$ correlation is computed. Kendall’s $\tau$ is a scaled index determining the correlation between the lists. It is computed by evaluating the number of pairwise concordances and discordances between ranking lists and produces values between $-1$ (for inverted order) and $1$ (for identical order). A violin plot, which simultaneously depicts a boxplot and a density plot, is generated from the results. +The ranking list based on the full assessment data is pairwise compared with the ranking lists based on the individual bootstrap samples (here $b=$ `r ncol(boot_object$bootsrappedRanks[[1]])` samples). For each pair of rankings, Kendall's $\tau$ correlation is computed. Kendall’s $\tau$ is a scaled index determining the correlation between the lists. It is computed by evaluating the number of pairwise concordances and discordances between ranking lists and produces values between $-1$ (for inverted order) and $1$ (for identical order). A violin plot, which simultaneously depicts a boxplot and a density plot, is generated from the results. \bigskip ```{r violin, results='asis'} violin(boot_object) ```