Relationship between performance measures: From statistical evaluations to decision-analysis Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands [email protected] Chicago, October 23, 2011 General issues Usefulness / Clinical utility: what do we mean exactly? Evaluation of predictions Evaluation of decisions Adding a marker to a model Statistical significance? Testing β enough (no need to test increase in R2, AUC, IDI, …) Clinical relevance: measurement worth the costs? (patient and physician burden, financial costs) Overview Case study: residual masses in testicular cancer Model development Evaluation approach Performance evaluation Statistical Overall Calibration and discrimination Decision-analytic Utility-weighted measures www.clinicalpredictionmodels.org Prediction approach Outcome: malignant or benign tissue Predictors: primary histology 3 tumor markers tumor size (postchemotherapy, and reduction) Model: logistic regression 544 patients, 299 malignant tissue Internal validation by bootstrapping External validation in 273 patients, 197 malignant tissue Logistic regression results Characteristic Primary tumor teratoma-positive? Prechemotherapy AFP elevated? Prechemotherapy HCG elevated? Square root of postchemotherapy mass size (mm) Reduction in mass size per 10% Ln of standardised prechemotherapy LDH (LDH/upper limit of local normal value) Without LDH 2.7 [1.8 – 4.0] 2.4 [1.5 – 3.7] 1.7 [1.1 – 2.7] 1.08 [0.95 – 1.23] 0.77 [0.70 – 0.85] - With LDH 2.5 [1.6 – 3.8] 2.5 [1.6 – 3.9] 2.2 [1.4 – 3.4] 1.34 [1.14 – 1.57] 0.85 [0.77 – 0.95] 0.37 [0.25 – 0.56] Evaluation approach: graphical assessment 0.8 0.6 Necrosis Tumor 0.0 0.2 0.4 0.6 0.8 Predicted probability 1.0 0.0 0.2 0.4 0.6 0.4 0.2 0.0 Observed frequency 0.8 1.0 Validation, n=273 1.0 Development, n=544 Necrosis Tumor 0.0 0.2 0.4 0.6 0.8 Predicted probability 1.0 Lessons 1. Plot observed versus expected outcome with distribution of predictions by outcome (‘Validation graph’) 2. Performance should be assessed in validation sets, since apparent performance is optimistic (model developed in the same data set as used for evaluation) Preferably external validation At least internal validation, e.g. by bootstrap cross-validation Performance evaluation Statistical criteria: predictions close to observed outcomes? Overall; consider residuals y – ŷ, or y – p Discrimination: separate low risk from high risk Calibration: e.g. 70% predicted = 70% observed Clinical usefulness: better decision-making? One cut-off, defined by expected utility / relative weight of errors Consecutive cut-offs: decision curve analysis Predictions close to observed outcomes? Penalty functions Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p) Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2 4 0.25 0.5 0 1 0 20 40 60 Predicted probability (%) 80 100 Quadratic score 1 y=0 2 3 y=1 0 Logarithmic score 5 6 Behavior of logarithmic and quadratic error scores Overall performance measures R2: explained variation Logistic / Cox model: Nagelkerke’s R2 Brier score: Y*(1 – p)^2 + (1 – Y)*p^2 Brierscaled = 1 – Brier / Briermax Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2 Brierscaled very similar to Pearson R2 for binary outcomes Overall performance in case study R2 Brier Briermax Brierscaled Development 38.9% 0.174 0.248 29.8% Internal validation 37.6% 0.178 0.248 28.2% External validation 26.7% 0.161 0.201 20.0% Measures for discrimination Concordance statistic, or area under the ROC curve Discrimination slope Lorenz curve ROC curves for case study Development, n=544 1.0 0% 30% 1.0 30% 40% 20% 40% 0.6 0.4 0.6 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0% 20% 0.8 True positive rate 0.8 True positive rate Validation, n=273 0.6 False positive rate 0.8 1.0 0.0 0.2 0.4 0.6 False positive rate 0.8 1.0 0 1 Tumor 0.0 0.0 0.0 0.4 0.6 0.8 0.4 0.6 0.8 0 1 Tumor 0.4 0.6 0.8 1.0 1.0 1.0 Slope=0.3 0.2 Predicted risk without LDH 0.2 Predicted risk with LDH 0.2 Predicted risk without LDH Box plots with discrimination slope for case study Slope=0.34 Slope=0.24 0 1 Tumor at validation 0.8 0.6 0.4 0.5 0.2 0.6 0.64 0.7 0.83 0.98 0.0 Proportion with the outcome 1.0 Lorenz concentration curves: general pattern 0.0 0.2 0.4 0.6 Cumulative proportion 0.8 1.0 Lorenz concentration curves: case study 1.0 Validation, n=273 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.2 0.0 Fraction with unresected tumor Development, n=544 0.0 0.2 0.4 0.6 0.8 1.0 Fraction NOT undergoing resection 0.0 0.2 0.4 0.6 0.8 1.0 Fraction NOT undergoing resection Discriminative ability of testicular cancer model Development n=544, 245 necrosis C statistic 0.818 [95% CI] [0.783 – 0.852] Discrimination slope 0.301 [95% CI] [0.235 – 0.367]# Lorenz curve p25, tumors missed 9% p75, tumors missed 58% Internal validation External validation n=273, 76 necrosis 0.812 0.785 [0.777 – 0.847]** [0.726 – 0.844] 0.294 0.237 [0.228 – 0.360]** [0.178 – 0.296]# 13% 65% Characteristics of measures for discrimination Measure Concordance statistic Calculation Rank order statistic Visualization ROC curve Discrimination slope Difference in mean of predictions between outcomes Shows concentration of outcomes missed by cumulative proportion of negative classifications Box plot Lorenz curve Concentration curve Pros Insensitive to outcome incidence; interpretable for pairs of patients with and without the outcome Easy interpretation, nice visualization Shows balance between finding true positive subjects versus total classified as positive Cons Interpretation artificial Depends on the incidence of the outcome Depends on the incidence of the outcome Measures for calibration Graphical assessments Cox recalibration framework (1958) Tests for miscalibration Cox; Hosmer-Lemeshow; Goeman - LeCessie 0.8 0.6 0.4 0.2 Ideal Nonparametric Grouped observations 0.0 Fraction with actual outcome 1.0 Calibration: general principle 0.0 0.2 0.4 0.6 Predicted probability 0.8 1.0 Calibration: case study 0.8 0.6 Necrosis Tumor 0.0 0.2 0.4 0.6 0.8 Predicted probability 1.0 0.0 0.2 0.4 0.6 0.4 0.2 0.0 Observed frequency 0.8 1.0 Validation, n=273 1.0 Development, n=544 Necrosis Tumor 0.0 0.2 0.4 0.6 0.8 Predicted probability 1.0 Calibration tests Calibration-in-the-large Calibration slope Recalibration Calibration-in-the-large Calibration slope Calibration tests Overall miscalibration Hosmer-Lemeshow Goeman – Le Cessie# H0 a=0 | boverall =1 boverall =1 a=0 and b overall =1 H1 a<>0 | boverall =1 boverall <> 1 a<>0 or boverall <> 1 df 1 1 2 Development 0 1 Internal validation 0 0.97 ** External validation –0.03 0.74 p=1 p=0.66 p=0.63 - p=0.13 p=0.42 p=0.94 Hosmer-Lemeshow test for testicular cancer model Decile 1 2 3 4 5 6 7 8 9 10 P <7.3% 7.3-16.5% 16.6-26.5% 26.6-34.7% 34.8-43.6% 43.7-54.0% 54.1-63.5% 63.6-73.8% 73.9-85.0% >85.0% Development N Predicted Observed 56 2.4 1 53 6.3 4 55 11.6 13 54 16.4 15 54 21.0 25 58 28.5 33 52 31.0 31 54 36.9 36 54 42.8 40 54 48.0 47 544 245 245 Chi-square=5.9, df=8, p=0.66 P <1.8% 1.8-7.3% 7.4-11.1% 11.2-17.5% 17.6-24.3% 24.4-31.0% 31.1-37.2% 37.3-54.6% 54.7-64.7% >64.7% Validation N Predicted Observed 31 0.2 1 25 1.1 1 31 2.6 4 30 4.4 5 27 5.6 7 30 8.1 6 20 6.7 9 38 17.2 18 15 8.8 8 26 20.3 17 273 74.9 76 Chi-square=9.2, df=9, p=0.42 Some calibration and goodness-of-fit tests Performance aspect Calibration-inthe-large Compare mean(y) versus mean(ŷ) Calibration graph Calibration slope Regression slope of linear predictor Calibration graph Calibration test Joint test of calibrationin-the-large and calibration slope Absolute difference between smoothed y versus ŷ Calibration graph Compare observed versus predicted in grouped patients Consider correlation between residuals Calibration graph or table - Compare observed versus predicted in subgroups Table Harrell’s E statistic HosmerLemeshow test Goeman – Le Cessie test Subgroup calibration Calculation Visualization Calibration graph Pros Key issue in validation; statistical testing possible Key issue in validation; statistical testing possible Efficient test of 2 key issues in calibration Conceptually easy, summarizes miscalibration over whole curve Conceptually easy Overall statistical test; supplementary to calibration graph Conceptually easy Cons By definition OK in model development setting By definition OK in model development setting Insensitive to more subtle miscalibration Depends on smoothing algorithm Interpretation difficult; low power in small samples Very general Not sensitive to various miscalibration patterns Lessons 1. Visual inspection of calibration important at external validation, combined with test for calibration-in-the-large and calibration slope Clinical usefulness: making decisions Diagnostic work-up Test ordering Starting treatment Therapeutic decision-making Surgery Intensity of treatment Decision curve analysis Andrew Vickers Departments of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center How to evaluate predictions? Prediction models are wonderful! How to evaluate predictions? Prediction models are wonderful! How do you know that they do more good than harm? Overview of talk • Traditional statistical and decision analytic methods for evaluating predictions • Theory of decision curve analysis Illustrative example • Men with raised PSA are referred for prostate biopsy • In the USA, ~25% of men with raised PSA have positive biopsy • ~750,000 unnecessary biopsies / year in US • Could a new molecular marker help predict prostate cancer? Molecular markers for prostate cancer detection • Assess a marker in men undergoing prostate biopsy for elevated PSA • Create “base” model: – Logistic regression: biopsy result as dependent variable; PSA, free PSA, age as predictors • Create “marker” model – Add marker(s) as predictor to the base model • Compare “base” and “marker” model How to evaluate models? • Biostatistical approach (ROC’ers) – P values – Accuracy (area-under-the-curve: AUC) • Decision analytic approach (VOI’ers) – Decision tree – Preferences / outcomes PSA velocity P value for PSAv in multivariable model <0.001 PSAv an “independent” predictor AUC: Base model = 0.609 Marker model =0 .626 AUCs and p values • I have no idea whether to use the model or not – Is an AUC of 0.626 high enough? – Is an increase in AUC of 0.017 enough to make measuring velocity worth it? Decision analysis • Identify every possible decision • Identify every possible consequence – Identify probability of each – Identify value of each p1 Cancer Cancer Biopsy p2 No cancer No Cancer Apply model p3 Cancer No biopsy No cancer No Cancer Cancer Cancer Biopsy No Cancer No cancer Cancer No biopsy No cancer No Cancer 1- (p1 + p2 + p3) (p1 + p3) a 1 - (p1 + p3) b (p1 + p3) 1 - (p1 + p3) c d a b c d Optimal decision • Use model – p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d • Treat all – (p1 + p3 )a + (1- (p1 + p3 ))b • Treat none – (p1 + p3 )c + (1- (p1 + p3 ))d • Which gives highest value? Drawbacks of traditional decision analysis • p’s require a cut-point to be chosen p1 Cancer Cancer Biopsy p2 No cancer No Cancer Apply model p3 Cancer No biopsy No cancer No Cancer Cancer Cancer Biopsy No Cancer No cancer Cancer No biopsy No cancer No Cancer 1- (p1 + p2 + p3) (p1 + p3) a 1 - (p1 + p3) b (p1 + p3) 1 - (p1 + p3) c d a b c d Problems with traditional decision analysis • p’s require a cut-point to be chosen • Extra data needed on health values outcomes (a – d) – Harms of biopsy – Harms of delayed diagnosis – Harms may vary between patients p1 Cancer Cancer Biopsy p2 No cancer No Cancer Apply model p3 Cancer No biopsy No cancer No Cancer Cancer Cancer Biopsy No Cancer No cancer Cancer No biopsy No cancer No Cancer 1- (p1 + p2 + p3) (p1 + p3) a 1 - (p1 + p3) b (p1 + p3) 1 - (p1 + p3) c d a b c d Evaluating values of health outcomes 1. Obtain data from the literature on: • Benefit of detecting cancer (cp to missed / delayed cancer) • Harms of unnecessary prostate biopsy (cp to no biopsy) • Burden: pain and inconvenience • Cost of biopsy Evaluating values of health outcomes 2. Obtain data from the individual patient: • What are your views on having a biopsy? • How important is it for you to find a cancer? Either way • Investigator: “here is a data set, is my model or marker of value?” • Analyst: “I can’t tell you, you have to go away and do a literature search first. Also, you have to ask each and every patient.” ROCkers and VOIers • ROCkers’ methods are simple and elegant but useless • VOIers’ methods are useful, but complex and difficult to apply Solving the decision tree Treatment No treatment Disease p a No disease 1-p b Disease p c No disease 1-p d Threshold probability Probability of disease is p̂ Define a threshold probability of disease as pptt Patient accepts treatment if pˆ pt Solve the decision tree • pt, cut-point for choosing whether to treat or not • Harm:Benefit ratio defines p – Harm: d – b (FP) – Benefit: a – c (TP) • pt / (1-pt) = H:B If P(D=1) = Pt pt d b ac 1 pt Treatment No treatment Disease pt a No disease 1-pt b Disease pt c No disease 1-p t d Intuitively • The threshold probability at which a patient will opt for treatment is informative of how a patient weighs the relative harms of false-positive and false-negative results. Nothing new so far • Equation has been used to set threshold for positive diagnostic test • Work out true harms and benefits of treatment and disease – E.g. if disease is 4 times worse than treatment, treat all patients with probability of disease >20%. A simple decision analysis 1. Select a pt A simple decision analysis 1. Select a pt 2. Positive test defined as pˆ pt A simple decision analysis 1. Select a pt 2. Positive test defined as pˆ pt 3. Count true positives (benefit), false positives (harm) A simple decision analysis 1. Select a pt 2. Positive test defined as pˆ pt 3. Count true positives (benefit), false positives (harm) 4. Calculate “Clinical Net Benefit” as: TruePositiveCount FalsePosit iveCount pt n n 1 pt Long history: Peirce 1884 Peirce 1884 Worked example at pt = 20% N=2742 Biopsy if risk ≥ 20% Biopsy all men Negative 346 0 True positive 653 710 False positive Net benefit calculation Net benefit 1743 653 – 1743 × (0.2 ÷ 0.8) 2742 0.079 2032 710- 2032× (0.2 ÷ 0.8) 2742 0.074 Net benefit has simple clinical interpretation • Net benefit of 0.079 at pt of 20% • Using the model is the equivalent of a strategy that identified the equivalent of 7.9 cancers per 100 patients with no unnecessary biopsies Net benefit has simple clinical interpretation • Difference between model and treat all at pt of 20%. – 5/1000 more TPs for equal number of FPs • Divide by weighting 0.005/ 0.25 = 0.02 – 20/1000 less FPs for equal number of TPs (=20/1000 fewer unnecessary biopsies with no missed cancers) Decision curve analysis 1. Select a pt 2. Positive test defined as pˆ pt 3. Calculate “Clinical Net Benefit” as: TruePositiveCount FalsePosit iveCount pt n n 1 pt 4. Vary pt over an appropriate range Vickers & Elkin Med Decis Making 2006;26:565–574 Decision curve: theory 0.3 0.2 0.1 Treat none 0.0 Net benefit 0.4 0.5 Treat none 0 20 40 60 Threshold probability (%) 80 100 0.5 Treat all [p(outcome)=50%] 0.3 0.2 0.1 Treat none 0.0 Net benefit 0.4 Treat all 0 20 40 60 Threshold probability (%) 80 100 0.5 Decisions with model 0.3 0.1 0.2 Decisions based on model Treat none 0.0 Net benefit 0.4 Treat all 0 20 40 60 Threshold probability (%) 80 100 Points in Decision Curves • If treat none, NB = .. • If treat all, and threshold = 0%, NB = … • If cut-off is incidence of end point – NBtreat none = NBtreat all = … Decision curve analysis • Decision curve analysis tells us about the clinical value of a model where accuracy metrics do not • Decision curve analysis does not require either: – Additional data – Individualized assessment • Simple to use software is available to implement decision curve analysis www.decisioncurveanalysis.org Decision analysis in the medical research literature • Only a moderate number of papers devoted to decision analysis • Many thousands of papers analyzed without reference to decision making (ROC curves, p values) Decision Curve Analysis • With thanks to…. – Elena Elkin – Mike Kattan – Daniel Sargent – Stuart Baker – Barry Kramer – Ewout Steyerberg Illustrations Clinical usefulness of testicular cancer model Cutoff 70% necrosis / 30% malignant, motivated by Decision analysis Current practice: ≈ 65% Net benefit calculations Resect all: NB=(299–3/7∙245)/544= 0.357 0.602 Resect none: NB = (0 – 0) / 544 = 0 0 Model: NB =(275–3/7∙143)/544= 0.393 0.602 Difference model – resect all: 0.036 0 more resections of tumor 3.6/100 0 at the same number of unnecessary resections of necrosis Decision curves for testicular cancer model Comparison of performance measures Lessons 1. Clinical usefulness may be limited despite reasonable discrimination and calibration Which performance measure when? It depends … Evaluation of usefulness requires weighting and consideration of outcome incidence Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not exist. Stat Med. 2000;19(4):431-40. Summary indices vs graphs (e.g. area vs ROC curve, validation graphs, decision curves, reclassification table vs predictiveness curve) Which performance measure when? 1. Discrimination: if poor, usefulness unlikely, but NB >= 0 2. Calibration: if poor in new setting, risk of NB<0 Conclusions Statistical evaluations important, but may be at odds with evaluation of clinical usefulness; ROC 0.8 good? 0.6 always poor? NO! Decision-analytic based performance measures, such as decision curves, are important to consider in the evaluation of the potential of a prediction model to support individualized decision making References Steyerberg, EW. Clinical prediction models: a practical approach to development, validation, and updating. New York, Springer, 2009. Vickers AJ, Elkin EB: Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 26:565-74, 2006 Steyerberg EW, Vickers AJ: Decision Curve Analysis: A Discussion. Med Decis Making 28; 146, 2008 Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30:11-21, 2011 Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, buchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology, 21:128-38, 2010 Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ, Van Calster B. Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest. 2011. Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and markers: evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011 Evaluation of incremental value of markers Case study: CVD prediction Cohort: 3264 participants in Framingham Heart Study Age 30 to 74 years 183 developed CHD (10 year risk: 5.6%) Data as used in Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27:157-172, 2008 Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and markers: evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011 Analysis Cox proportional hazards models Time to event data Reference model: Dichotomous: Sex, diabetes, smoking Continuous: age, systolic blood pressure (SBP), total cholesterol as continuous All hazard ratios statistically signicant Add high-density lipoprotein (HDL) cholesterol continuous predictor highly signicant (hazard ratio = 0.65, P-value < .001) How good are these models? Performance of reference model Incremental value of HDL Performance criteria Steyerberg EW, Van Calster B, Pencina, M. Medidas del rendimiento de modelos de prediccio ´n y marcadores pronosticos: evaluacion de las predicciones y clasiﬁcaciones. Rev Esp Cardiol. 2011. doi:10.1016/j.recesp.2011.04.017 Case study: quality of predictions Discrimination Area: 0.762 without HDL vs 0.774 with HDL Calibration Internal: quite good External: more relevant Performance Full range of predictions ROC R2 .. Classifications / decisions Cut-off to define low vs high risk Determine a cut-off for classification Data-driven cut-off Youden’s index: sensitivity + speciﬁcity – 1 E.g. sens 80%, spec 80% Youden = … E.g. sens 90%, spec 80% Youden = … E.g. sens 80%, spec 90% Youden = … E.g. sens 40%, spec 60% Youden = … E.g. sens 100%, spec 100% Youden = … Youden’s index maximized: upper left corner ROC curve If predictions perfectly calibrated Upper left corner: cut-off = incidence of the outcome Incidence = 183/3264 = 5.6% Determine a cut-off for classification Data-driven cut-off Youden’s index: sensitivity + speciﬁcity – 1 Decision-analytic Cut-off determined by clinical context Relative importance (‘utility’) of the consequence of a true or false classiﬁcation True-positive classiﬁcation: correct treatment False-positive classiﬁcation: overtreatment True-negative classiﬁcation: no treatment False-negative classiﬁcation: undertreatment Harm: net overtreatment (FP-TN) Benefit: net correct treatment (TP-FN) Odds of the cut-off = H:B ratio Evaluation of performance Youden index: “science of the method” Net Benefit: “utility of the method” References: Peirce, Science 1884 Vergouwe, Semin Urol Oncol 2002 Vickers, MDM 2006 Net Benefit Net Benefit = (TP – w FP) / N w = cut-off/ (1 – cut-off) e.g.: cut-off 50%: w = .5/.5=1; cut-off 20%: w=.2/.8=1/4 w = H : B ratio “Number of true-positive classifications, penalized for false-positive classifications” Increase in AUC 5.6%: AUC 0.696 0.719 20% : AUC 0.550 0.579 Continuous variant Area: 0.762 0.774 Addition of a marker to a model Typically small improvement in discriminative ability according to AUC (or c statistic) c stat blamed for being insensitive Study ‘Reclassification’ Net Reclassification Index: improvement in sensitivity + improvement in specificity = (move up | event – move down | event) + (move down | non-event – move up | non-event ) 22/183=12% -1/3081=.03% 29 7 173 174 NRI for 5.6% cut-off? NRI for CHD: 7/183 = 3.8% NRI for No CHD: 24/3081 = 0.8% NRI = 4.6% NRI and sens/spec NRI = delta sens + delta spec Sens w/out = 135/183 = 73.8% Sens with HDL = 142/183 = 77.6% NRI better than delta AUC? NRI = delta(sens) + delta(spec) AUC for binary classification = (sens + spec) / 2 NRI and delta AUC NRI = delta(sens) + delta(spec) AUC for binary classification = (sens + spec) / 2 Delta AUC = (delta(sens) + delta(spec)) / 2 NRI = 2 x delta(AUC) Delta(Youden) = delta(sens) + delta(spec) NRI = delta(Youden) NRI has ‘absurd’ weighting? Decision-analytic performance: NB Net Benefit = (TP – w FP) / N No HDL model: TP = 3+132 = 135 FP = 166 + 901= 1067 w = 0.056/0.944 = 0.059 N = 3264 NB = (135 – 0.059 x 1067) / 3264 = 2.21% With HDL model: NB = (142 – 0.059 x 1043) / 3264 = 2.47% Delta(NB) Increase in TP: 10 – 3 = 7 Decrease in FP: 166 – 142 = 24 Increase in NB: (7 + 0.059 x 24) / 3264 = 0.26% Interpretation: “2.6 more true CHD events identiﬁed per 1000 subjects, at the same number of FP classiﬁcations.” “ HDL has to be measured in 1/0.26% = 385 subjects to identify one more TP” Application to FHS Continuous NRI: no categories All cut-offs; information similar to AUC and Decision Curve