diff --git a/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd b/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd index 75b0cb6..73e8a3e 100644 --- a/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd +++ b/inst/appdir/characterizationOfAlgorithmsBootstrapping.Rmd @@ -1,69 +1,69 @@ ### Ranking stability: Ranking variability via bootstrap approach A blob plot of bootstrap results over the different tasks separated by algorithm allows another perspective on the assessment data. This gives deeper insights into the characteristics of tasks and the ranking uncertainty of the algorithms in each task. \bigskip ```{r blobplot_bootstrap_byAlgorithm,fig.width=7,fig.height = 5} #stabilityByAlgorithm.bootstrap.list -if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ +if (n.tasks<=6 & n.algorithms<=10 ){ stabilityByAlgorithm(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4, single = F) + scale_color_manual(values=cols) } else { pl=stabilityByAlgorithm(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4, single = T) for (i in 1:length(pl)) print(pl[[i]] + scale_color_manual(values=cols) + guides(size = guide_legend(title="%"),color="none") ) } ``` \newpage An alternative representation is provided by a stacked frequency plot of the observed ranks, separated by algorithm. Observed ranks across bootstrap samples are displayed with coloring according to the task. For algorithms that achieve the same rank in different tasks for the full assessment data set, vertical lines are on top of each other. Vertical lines allow to compare the achieved rank of each algorithm over different tasks. \bigskip ```{r stackedFrequencies_bootstrap_byAlgorithm,fig.width=7,fig.height = 5} -if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ +if (n.tasks<=6 & n.algorithms<=10 ){ stabilityByAlgorithm(boot_object, ordering=ordering_consensus, stacked = TRUE, single = F) } else { pl=stabilityByAlgorithm(boot_object, ordering=ordering_consensus, stacked = TRUE, single = T) print(pl) } ``` diff --git a/inst/appdir/characterizationOfTasksBootstrapping.Rmd b/inst/appdir/characterizationOfTasksBootstrapping.Rmd index 6177954..b63fa87 100644 --- a/inst/appdir/characterizationOfTasksBootstrapping.Rmd +++ b/inst/appdir/characterizationOfTasksBootstrapping.Rmd @@ -1,49 +1,49 @@ ### Visualizing bootstrap results To investigate which tasks separate algorithms well (i.e., lead to a stable ranking), a blob plot is recommended. Bootstrap results can be shown in a blob plot showing one plot for each task. In this view, the spread of the blobs for each algorithm can be compared across tasks. Deviations from the diagonal indicate deviations from the consensus ranking (over tasks). Specifically, if rank distribution of an algorithm is consistently below the diagonal, the algorithm performed better in this task than on average across tasks, while if the rank distribution of an algorithm is consistently above the diagonal, the algorithm performed worse in this task than on average across tasks. At the bottom of each panel, ranks for each algorithm in the tasks are provided. Same as in Section \ref{blobByTask} but now ordered according to consensus. \bigskip ```{r blobplot_bootstrap_byTask,fig.width=9, fig.height=9, results='hide'} #stabilityByTask.bootstrap.list -if (length(boot_object$matlist)<=6 &nrow((boot_object$matlist[[1]]))<=10 ){ +if (n.tasks<=6 & n.algorithms<=10 ){ stabilityByTask(boot_object, ordering=ordering_consensus, max_size = 9, size=4, shape=4) + scale_color_manual(values=cols) } else { pl=list() for (subt in names(boot_object$bootsrappedRanks)){ a=list(bootsrappedRanks=list(boot_object$bootsrappedRanks[[subt]]), matlist=list(boot_object$matlist[[subt]])) names(a$bootsrappedRanks)=names(a$matlist)=subt class(a)="bootstrap.list" r=boot_object$matlist[[subt]] pl[[subt]]=stabilityByTask(a, max_size = 9, ordering=ordering_consensus, size.ranks=.25*theme_get()$text$size, size=4, - shape=4) + scale_color_manual(values=cols) + ggtitle(subt) + shape=4) + scale_color_manual(values=cols) + ggtitle(subt)+ theme(legend.position = "bottom") } print(pl) } ``` \ No newline at end of file diff --git a/inst/appdir/report.Rmd b/inst/appdir/report.Rmd index 08cf868..b35121f 100644 --- a/inst/appdir/report.Rmd +++ b/inst/appdir/report.Rmd @@ -1,404 +1,406 @@ --- params: object: NA colors: NA name: NULL consensus: NA isMultiTask: NA bootstrappingEnabled: NA fig.format: NULL dpi: NULL title: "Benchmarking report for `r params$name` " author: "created by challengeR v`r packageVersion('challengeR')`" date: "`r Sys.setlocale('LC_TIME', 'English'); format(Sys.time(), '%d %B, %Y')`" editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} options(width=80) #out.format <- knitr::opts_knit$get("out.format") out.format <- knitr::opts_knit$get("rmarkdown.pandoc.to") img_template <- switch( out.format, docx = list("img-params"=list(dpi=150, fig.width=6, fig.height=6, out.width="504px", out.height="504px")), { # default list("img-params"=list( fig.width=7, fig.height = 3, dpi=300)) } ) knitr::opts_template$set( img_template ) knitr::opts_chunk$set(echo = F) # ,#fig.width=7,fig.height = 3,dpi=300, if (out.format != "docx") knitr::opts_chunk$set(fig.align = "center") if (!is.null(params$fig.format)) knitr::opts_chunk$set(dev = params$fig.format) # can be vector, e.g. fig.format=c('jpeg','png', 'pdf') if (!is.null(params$dpi)) knitr::opts_chunk$set(dpi = params$dpi) theme_set(theme_light()) isMultiTask = params$isMultiTask bootstrappingEnabled = params$bootstrappingEnabled ``` ```{r } object = params$object if (isMultiTask) { ordering_consensus=names(params$consensus) } else { ordering_consensus=names(sort(t(object$matlist[[1]][,"rank",drop=F])["rank",])) } color.fun=params$colors ``` ```{r } challenge_multiple=object$data ranking.fun=object$FUN cols_numbered=cols=color.fun(length(ordering_consensus)) names(cols)=ordering_consensus names(cols_numbered)= paste(1:length(cols),names(cols)) if (bootstrappingEnabled) { boot_object = params$object challenge_multiple=boot_object$data ranking.fun=boot_object$FUN object=challenge_multiple%>%ranking.fun object$FUN.list = boot_object$FUN.list object$fulldata=boot_object$fulldata # only not NULL if subset of algorithms used } - +n.tasks <- length(object$matlist) +n.algorithms <- nrow((object$matlist[[1]])) ``` This document presents a systematic report on the benchmark study "`r params$name`". Input data comprises raw metric values for all algorithms and cases. Generated plots are: ```{r, child=if (!isMultiTask && !bootstrappingEnabled) system.file("appdir", "overviewSingleTaskNoBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (!isMultiTask && bootstrappingEnabled) system.file("appdir", "overviewSingleTaskBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (isMultiTask && !bootstrappingEnabled) system.file("appdir", "overviewMultiTaskNoBootstrapping.Rmd", package="challengeR")} ``` ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "overviewMultiTaskBootstrapping.Rmd", package="challengeR")} ``` Details can be found in Wiesenfarth et al. (2019). ```{r,results='asis'} if (isMultiTask) { cat("# Rankings\n") } else { cat("# Ranking") } ``` Algorithms within a task are ranked according to the following ranking scheme: ```{r,results='asis'} a=( lapply(object$FUN.list[1:2],function(x) { if (class(x)== "standardGeneric") return(paste0("aggregate using function ", x@generic )) else if (!is.character(x)) return(paste0("aggregate using function ", paste(gsub("UseMethod","", deparse(functionBody(x))), collapse=" ") )) else if (x=="rank") return(x) else return(paste0("aggregate using function ",x)) })) cat("    *",paste0(a,collapse=" then "),"*",sep="") if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="significance") cat("\n\n Column 'prop_significance' is equal to the number of pairwise significant test results for a given algorithm divided by the number of algorithms.") ``` ```{r,results='asis'} if (isMultiTask) { cat("Ranking for each task:\n") - for (t in 1:length(object$matlist)){ + for (t in 1:n.tasks){ cat("\n",names(object$matlist)[t],": ") n.cases=nrow(challenge_multiple[[t]])/length(unique(challenge_multiple[[t]][[attr(challenge_multiple,"algorithm")]])) numberOfAlgorithms <- length(levels(challenge_multiple[[t]][[attr(challenge_multiple, "algorithm")]])) cat("\nThe analysis is based on", numberOfAlgorithms, "algorithms and", n.cases, "cases.", attr(object$data,"n.missing")[[t]], "missing cases have been found in the data set. ") if (nrow(attr(object$data,"missingData")[[t]])>0) { if(attr(object$data,"n.missing")[[t]]==0 ) cat("However, ") else if(attr(object$data,"n.missing")[[t]]>0 ) cat("Additionally, ") cat("performance of not all algorithms has been observed for all cases in task '", names(object$matlist)[t], "'. Therefore, missings have been inserted in the following cases:") print(knitr::kable(as.data.frame(attr(object$data,"missingData")[[t]]))) } if (nrow(attr(object$data,"missingData")[[t]])>0 | attr(object$data,"n.missing")[[t]]>0) { if (is.numeric(attr(object$data,"na.treat"))) cat("All missings have been replaced by values of", attr(object$data,"na.treat"),".\n") else if (is.character(attr(object$data,"na.treat")) && attr(object$data,"na.treat")=="na.rm") cat("All missings have been removed.") else if (is.function(attr(object$data,"na.treat"))) { cat("Missings have been replaced using function ") print(attr(object$data,"na.treat")) } else if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="rank") cat("Missings lead to the algorithm ranked last for the missing case.") } x=object$matlist[[t]] print(knitr::kable(x[order(x$rank),])) } } else { n.cases=nrow(challenge_multiple[[1]])/length(unique(challenge_multiple[[1]][[attr(challenge_multiple,"algorithm")]])) # Is subset of algorithms used? if (!is.null(object$fulldata[[1]])) { cat("The top ", length(levels(challenge_multiple[[1]][[attr(challenge_multiple, "algorithm")]])), " out of ", length(levels(object$fulldata[[1]][[attr(challenge_multiple, "algorithm")]])), " algorithms are considered.\n") cat("\nThe analysis is based on", n.cases, "cases. ") } else { cat("\nThe analysis is based on", length(levels(challenge_multiple[[1]][[attr(challenge_multiple, "algorithm")]])), "algorithms and", n.cases, "cases. ") } cat(attr(object$data,"n.missing")[[1]], "missing cases have been found in the data set. ") if (nrow(attr(object$data,"missingData")[[1]])>0) { if(attr(object$data,"n.missing")[[1]]==0 ) cat("However, ") else if(attr(object$data,"n.missing")[[1]]>0 ) cat("Additionally, ") cat("performance of not all algorithms has been observed for all cases. Therefore, missings have been inserted in the following cases:") print(knitr::kable(as.data.frame(attr(object$data,"missingData")[[1]]))) } if (nrow(attr(object$data,"missingData")[[1]])>0 | attr(object$data,"n.missing")[[1]]>0) { if (is.numeric(attr(object$data,"na.treat"))) cat("All missings have been replaced by values of", attr(object$data,"na.treat"),".\n") else if (is.character(attr(object$data,"na.treat")) && attr(object$data,"na.treat")=="na.rm") cat("All missings have been removed.") else if (is.function(attr(object$data,"na.treat"))) { cat("Missings have been replaced using function ") print(attr(object$data,"na.treat")) } else if (is.character(object$FUN.list[[1]]) && object$FUN.list[[1]]=="rank") cat("Missings lead to the algorithm ranked last for the missing case.") } cat("\n\nRanking:") x=object$matlist[[1]] print(knitr::kable(x[order(x$rank),])) } ``` \bigskip ```{r, child=if (isMultiTask) system.file("appdir", "consensusRanking.Rmd", package="challengeR")} ``` \newpage # Visualization of raw assessment data ```{r,results='asis'} if (isMultiTask) { cat("The algorithms are ordered according to the computed ranks for each task.") } ``` ## Dot- and boxplot *Dot- and boxplots* for visualizing raw assessment data separately for each algorithm. Boxplots representing descriptive statistics over all cases (median, quartiles and outliers) are combined with horizontally jittered dots representing individual cases. \bigskip ```{r boxplots} boxplot(object, size=.8) ``` \newpage ## Podium plot *Podium plots* (see also Eugster et al., 2008) for visualizing raw assessment data. Upper part (spaghetti plot): Participating algorithms are color-coded, and each colored dot in the plot represents a metric value achieved with the respective algorithm. The actual metric value is encoded by the y-axis. Each podium (here: $p$=`r length(ordering_consensus)`) represents one possible rank, ordered from best (1) to last (here: `r length(ordering_consensus)`). The assignment of metric values (i.e. colored dots) to one of the podiums is based on the rank that the respective algorithm achieved on the corresponding case. Note that the plot part above each podium place is further subdivided into $p$ "columns", where each column represents one participating algorithm (here: $p=$ `r length(ordering_consensus)`). Dots corresponding to identical cases are connected by a line, leading to the shown spaghetti structure. Lower part: Bar charts represent the relative frequency for each algorithm to achieve the rank encoded by the podium place. ```{r, include=FALSE, fig.keep="none",dev=NULL} plot.new() algs=ordering_consensus l=legend("topright", paste0(1:length(algs),": ",algs), lwd = 1, cex=1.4,seg.len=1.1, title="Rank: Alg.", plot=F) w <- grconvertX(l$rect$w, to='ndc') - grconvertX(0, to='ndc') h<- grconvertY(l$rect$h, to='ndc') - grconvertY(0, to='ndc') addy=max(grconvertY(l$rect$h,"user","inches"),6) ``` ```{r podium,eval=T,fig.width=12, fig.height=addy} #c(bottom, left, top, right op<-par(pin=c(par()$pin[1],6), omd=c(0, 1-w, 0, 1), mar=c(par('mar')[1:3], 0)+c(-.5,0.5,-.5,0), cex.axis=1.5, cex.lab=1.5, cex.main=1.7) oh=grconvertY(l$rect$h,"user","lines")-grconvertY(6,"inches","lines") if (oh>0) par(oma=c(oh,0,0,0)) set.seed(38) podium(object, col=cols, lines.show = T, lines.alpha = .4, dots.cex=.9, ylab="Metric value", layout.heights=c(1,.35), legendfn = function(algs, cols) { legend(par('usr')[2], par('usr')[4], xpd=NA, paste0(1:length(algs),": ",algs), lwd = 1, col = cols, bg = NA, cex=1.4, seg.len=1.1, title="Rank: Alg.") } ) par(op) ``` \newpage ## Ranking heatmap *Ranking heatmaps* for visualizing raw assessment data. Each cell $\left( i, A_j \right)$ shows the absolute frequency of cases in which algorithm $A_j$ achieved rank $i$. \bigskip ```{r rankingHeatmap,fig.width=9, fig.height=9,out.width='70%'} rankingHeatmap(object) ``` \newpage # Visualization of ranking stability ```{r, child=if (bootstrappingEnabled) system.file("appdir", "visualizationBlobPlots.Rmd", package="challengeR")} ``` ```{r, child=if (bootstrappingEnabled) system.file("appdir", "visualizationViolinPlots.Rmd", package="challengeR")} ``` \newpage ## *Significance maps* for visualizing ranking stability based on statistical significance *Significance maps* depict incidence matrices of pairwise significant test results for the one-sided Wilcoxon signed rank test at a 5\% significance level with adjustment for multiple testing according to Holm. Yellow shading indicates that metric values from the algorithm on the x-axis were significantly superior to those from the algorithm on the y-axis, blue color indicates no significant difference. \bigskip ```{r significancemap,fig.width=6, fig.height=6,out.width='200%'} significanceMap(object,alpha=0.05,p.adjust.method="holm") ``` \newpage ## Ranking robustness to ranking methods *Line plots* for visualizing ranking robustness across different ranking methods. Each algorithm is represented by one colored line. For each ranking method encoded on the x-axis, the height of the line represents the corresponding rank. Horizontal lines indicate identical ranks for all methods. \bigskip ```{r lineplot,fig.width=8, fig.height=6,out.width='95%'} -if (length(object$matlist)<=6 &nrow((object$matlist[[1]]))<=10 ){ + +if (n.tasks<=6 &n.algorithms<=10 ){ methodsplot(challenge_multiple, ordering = ordering_consensus, na.treat=object$call[[1]][[1]]$na.treat) + scale_color_manual(values=cols) }else { x=challenge_multiple for (subt in names(challenge_multiple)){ dd=as.challenge(x[[subt]], value=attr(x,"value"), algorithm=attr(x,"algorithm") , case=attr(x,"case"), annotator = attr(x,"annotator"), by=attr(x,"by"), smallBetter = attr(x,"smallBetter"), na.treat=object$call[[1]][[1]]$na.treat ) print(methodsplot(dd, ordering = ordering_consensus) + ggtitle(subt) + scale_color_manual(values=cols) ) } } ``` ```{r, child=if (isMultiTask) system.file("appdir", "visualizationAcrossTasks.Rmd", package="challengeR")} ``` \newpage # References Wiesenfarth, M., Reinke, A., Landman, B.A., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. (2019). Methods and open-source toolkit for analyzing and visualizing challenge results. *arXiv preprint arXiv:1910.05121* M. J. A. Eugster, T. Hothorn, and F. Leisch, “Exploratory and inferential analysis of benchmark experiments,” Institut fuer Statistik, Ludwig-Maximilians-Universitaet Muenchen, Germany, Technical Report 30, 2008. [Online]. Available: http://epub.ub.uni-muenchen.de/4134/. diff --git a/inst/appdir/visualizationAcrossTasks.Rmd b/inst/appdir/visualizationAcrossTasks.Rmd index 5694ba3..4f0e52d 100644 --- a/inst/appdir/visualizationAcrossTasks.Rmd +++ b/inst/appdir/visualizationAcrossTasks.Rmd @@ -1,118 +1,118 @@ \newpage # Visualization of cross-task insights The algorithms are ordered according to consensus ranking. ## Characterization of algorithms ### Ranking stability: Variability of achieved rankings across tasks Algorithms are color-coded, and the area of each blob at position $\left( A_i, \text{rank } j \right)$ is proportional to the relative frequency $A_i$ achieved rank $j$ across multiple tasks. The median rank for each algorithm is indicated by a black cross. This way, the distribution of ranks across tasks can be intuitively visualized. \bigskip ```{r blobplot_raw,fig.width=9, fig.height=9} #stability.ranked.list stability(object,ordering=ordering_consensus,max_size=9,size=8,shape=4)+ scale_color_manual(values=cols) ``` ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "characterizationOfAlgorithmsBootstrapping.Rmd", package="challengeR")} ``` \newpage ## Characterization of tasks ```{r, child=if (isMultiTask && bootstrappingEnabled) system.file("appdir", "characterizationOfTasksBootstrapping.Rmd", package="challengeR")} ``` ### Cluster Analysis Dendrogram from hierarchical cluster analysis and \textit{network-type graphs} for assessing the similarity of tasks based on challenge rankings. A dendrogram is a visualization approach based on hierarchical clustering. It depicts clusters according to a chosen distance measure (here: Spearman's footrule) as well as a chosen agglomeration method (here: complete and average agglomeration). \bigskip ```{r dendrogram_complete, fig.width=6, fig.height=5,out.width='60%'} -if (length(object$matlist)>2) { +if (n.tasks>2) { dendrogram(object, dist = "symdiff", method="complete") } else cat("\nCluster analysis only sensible if there are >2 tasks.\n\n") ``` \bigskip ```{r dendrogram_average, fig.width=6, fig.height=5,out.width='60%'} -if (length(object$matlist)>2) +if (n.tasks>2) dendrogram(object, dist = "symdiff", method="average") ``` - + - +