The DRAGON benchmark comprises 28 tasks to evaluate model performance. Please check out the manuscript for information on all tasks [PENDING]. An overview is shown in the figure and table below.

Figure: Overview of the tasks in the DRAGON benchmark.
Table. Overview of the tasks in the DRAGON benchmark.
| ID | Name | Task type | Metric | Sample report |
|---|---|---|---|---|
| T1 | Adhesion presence | SL Bin CLF | AUC | Report |
| T2 | Pulmonary nodule presence | SL Bin CLF | AUC | Report |
| T3 | Kidney abnormality identification | SL Bin CLF | AUC | Report |
| T4 | Skin histopathology case selection | SL Bin CLF | AUC | Report |
| T5 | RECIST timeline | SL Bin CLF | AUC | Report |
| T6 | Histopathology cancer origin | SL Bin CLF | AUC | Report |
| T7 | Pulmonary nodule size presence | SL Bin CLF | AUC | Report |
| T8 | Pancreatic ductal adenocarcinoma size presence | SL Bin CLF | AUC | Report |
| T9 | Pancreatic ductal adenocarcinoma diagnosis | SL MC CLF | Unweighted Kappa | Report |
| T10 | Prostate radiology suspicious lesions | SL MC CLF | Linearly Weighted Kappa | Report |
| T11 | Prostate histopathology significant cancers | SL MC CLF | Linearly Weighted Kappa | Report |
| T12 | Histopathology tissue type | SL MC CLF | Unweighted Kappa | Report |
| T13 | Histopathology tissue origin | SL MC CLF | Unweighted Kappa | Report |
| T14 | Entailment diagnostic sentences | SL MC CLF | Linearly Weighted Kappa | Report |
| T15 | Colon histopathology diagnosis | ML Bin CLF | Macro AUC | Report |
| T16 | RECIST lesion size presence | ML Bin CLF | AUC | Report |
| T17 | Pancreatic ductal adenocarcinoma attributes | ML MC CLF | Unweighted Kappa | Report |
| T18 | Hip Kellgren-Lawrence scoring | ML MC CLF | Unweighted Kappa | Report |
| T19 | Prostate volume extraction | SL Reg | RSMAPE (ε = 4 cm3) | Report |
| T20 | Prostate specific antigen extraction | SL Reg | RSMAPE (ε = 0.4 ng/mL) | Report |
| T21 | Prostate specific antigen density extraction | SL Reg | RSMAPE (ε = 0.04 ng/mL2) | Report |
| T22 | Pancreatic ductal adenocarcinoma size measurement | SL Reg | RSMAPE (ε = 4 mm) | Report |
| T23 | Pulmonary nodule size measurement | SL Reg | RSMAPE (ε = 4 mm) | Report |
| T24 | RECIST lesion size measurement | ML Reg | RSMAPE (ε = 4 mm) | Report |
| T25 | Anonymization | SL NER | Macro F1 | Report |
| T26 | Medical terminology recognition | SL NER | F1 | Report |
| T27 | Prostate biopsy sampling | ML NER | Weighted F1 | Report |
| T28 | Skin histopathology diagnosis | ML NER | Weighted F1 | Report |
SL = single-label, ML = multi-label, Bin = binary, MC = multi-class, CLF = classification, Reg = regression, NER = named entity recognition, RSMAPE = Robust Symmetric Mean Absolute Percentage Error, RECIST = response evaluation criteria in solid tumors.