diff --git a/vignettes/visualizations.Rmd b/vignettes/visualizations.Rmd index f916ef7..7c0a7f4 100644 --- a/vignettes/visualizations.Rmd +++ b/vignettes/visualizations.Rmd @@ -1,115 +1,130 @@ --- title: "Visualizations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{visualizations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(challengeR) ``` ## Introduction -The package enables the user to generate a benchmarking report that contains visualizations and respective explanations. -An overview of all available visualization is provided on this page along with their configurations. You can also look up the names of the functions that generate the visualizations (e.g. if you are interested in generating the plots on your own to apply other styles). +The package offers an intuitive way to gain important insights into the relative and absolute performance of algorithms. It enables you to generate a benchmarking report that contains visualizations and respective explanations. An overview of all available visualizations is provided on this page demonstrating the use of their corresponding plot functions. This might be of interest if you want to generate the plots separately (e.g. to apply other styles). + ## Visualizing assessment data ```{r} data <- read.csv(system.file("extdata", "data_matrix.csv", package = "challengeR", mustWork = TRUE)) challenge <- as.challenge(data, by = "task", algorithm = "alg_name", case = "case", value = "value", smallBetter = FALSE) ranking <- challenge%>%aggregateThenRank(FUN = mean, ties.method = "min") ``` ### Dot- and boxplots Dot- and boxplots visualize the assessment data separately for each algorithm. Boxplots representing descriptive statistics for all test cases (median, quartiles and outliers) are combined with horizontally jittered dots representing individual test cases. ```{r} boxplot(ranking) ``` ### Podium plots Upper part of the podium plot: Algorithms are color-coded, and each colored dot in the plot represents a performance value achieved with the respective algorithm. The actual value is encoded by the y-axis. Each podium (here: $p = 5$) represents one possible rank, ordered from best (1) to worst (here: 5). The assignment of values (i.e. colored dots) to one of the podiums is based on the rank that the respective algorithm achieved on the corresponding test case.Note that the plot part above each podium place is further subdivided into $p$ “columns”, where each column represents one algorithm. Dots corresponding to identical test cases are connected by a line, producing the spaghetti structure shown here. Lower part: Bar charts represent the relative frequency at which each algorithm actually achieves the rank encoded by the podium place. ```{r} # The podium plot is not available as an encapsulated function yet. # podium(challenge, xlab = "Podium", ylab = "Metric value") ``` ### Ranking heatmaps In a ranking heatmap, each cell $\left( i, A_j \right)$ shows the absolute frequency of cases in which algorithm $A_j$ achieved rank $i$. ```{r} rankingHeatmap(ranking) ``` ## Visualizing ranking stability The ranking robustness can by analyzed with respect to the ranking method used (see [paper](https://arxiv.org/abs/1910.05121) for different ranking methods). ### Line plots Line plots visualize the robustness of ranking across different ranking methods. Each algorithm is represented by one colored line. For each ranking method encoded on the x-axis, the height of the line represents the corresponding rank. Horizontal lines indicate identical ranks for all methods. ```{r, fig.width = 7} methodsplot(challenge) ``` For a specific ranking method, the ranking stability can be investigated via bootstrapping and the testing approach. A ranking object containing the bootstrapping samples has to be created which serves as the basis for the plots. ```{r, results = "hide"} set.seed(1) rankingBootstrapped <- ranking%>%bootstrap(nboot = 1000) ``` ### Blob plots Blob plots for visualizing ranking stability are based on bootstrap sampling. Algorithms are color-coded, and the area of each blob at position $\left( A_i, \text{rank } j \right)$ is proportional to the relative frequency $A_i$ achieved rank $j$ (here across $b = 1000$ bootstrap samples). The median rank for each algorithm is indicated by a black cross. 95% bootstrap intervals across bootstrap samples (ranging from the 2.5th to the 97.5th percentile of the bootstrap distribution) are indicated by black lines. ```{r, fig.width = 7} stabilityByTask(rankingBootstrapped) ``` ### Violin plots Violin plots provide a more condensed way to analyze bootstrap results. In these plots, the focus is on the comparison of the ranking list computed on the full assessment data and the individual bootstrap samples, respectively. Kendall’s $\tau$ is chosen for comparison as it is has an upper and lower bound (+1/-1). Kendall’s $\tau$ is computed for each pair of rankings, and a violin plot that simultaneously depicts a boxplot and a density plot is generated from the results. ```{r, results = "hide"} violin(rankingBootstrapped) ``` ### Significance maps Significance maps visualize ranking stability based on statistical significance. They depict incidence matrices of pairwise significant test results for the one-sided Wilcoxon signed rank test at 5% significance level with adjustment for multiple testing according to Holm. Yellow shading indicates that performance values of the algorithm on the x-axis are significantly superior to those from the algorithm on the y-axis, blue color indicates no significant difference. ```{r} significanceMap(ranking) ``` ## Visualizing cross-task insights For cross-task insights, a consensus ranking (rank aggregation across tasks) has to be given additionally. The consensus ranking according to mean ranks across tasks is computed here. ```{r} meanRanks <- ranking%>%consensus(method = "euclidean") ``` ### Blob plots Blob plots visualize the distribution of ranks across tasks. All ranks that an algorithm achieved in any task are displayed along the y-axis, with the area of the blob being proportional to the frequency. If all tasks provided the same stable ranking, narrow intervals around the diagonal would be expected. Consensus rankings above algorithm names highlight the presence of ties. -```{r} -# disabled until error is fixed: T27943 -# stability(ranking, ordering = meanRanks) +```{r, fig.width = 5, fig.height = 4} +stability(ranking) +``` + +#### Blob plots visualizing the ranking variability based on bootstrapping + +This variant of the blob plot approach involves replacing the algorithms on the x-axis with the tasks and then generating a separate plot for each algorithm. This allows assessing the variability of rankings for each algorithm across multiple tasks and bootstrap samples. Here, color coding is used for the tasks, and separation by algorithm enables a relatively straightforward strength-weaknesses analysis for individual methods. + +```{r, fig.width = 7, fig.height = 5} +stabilityByAlgorithm(rankingBootstrapped) +``` + +#### Stacked frequency plots visualizing the ranking variability based on bootstrapping + +An alternative representation is provided by a stacked frequency plot of the observed ranks, separated by algorithm. Observed ranks across bootstrap samples are displayed with coloring according to the task. For algorithms that achieve the same rank in different tasks for the full assessment data set, vertical lines are on top of each other. Vertical lines allow to compare the achieved rank of each algorithm over different tasks. + +```{r, fig.width = 7, fig.height = 5} +stabilityByAlgorithm(rankingBootstrapped, stacked = TRUE) ```