The Receiver Operating Characteristic (ROC) Curve

The basic idea of diagnostic test interpretation is to calculate the probability a patient has a disease under consideration given a certain test result. A 2 x 2 table is used as a mnemonic device. The test results are on the left side and the disease status on top as shown here:


Disease present

Disease absent

Test positive

True positives TP

False positives FP

Test negative

False negative FN

True negatives TN

We use the following terminology:

P( T+ | D- )

The above simply reads "the probability of the test being positive, given that the disease is not present". T+ is an abbreviation for "a positive test", and "D-" is similarly shorthand for "the disease isn't present".

P(X) is the probability of the event X and the vertical bar means "given that".

Statement

Translation

P(T+ | D+)

Sensitivity, true positive fraction = TPF

P(T- | D-)

Specificity, true negative fraction = TNF

P(T+ | D-)

False positive fraction FPF

P(T- | D+)

False negativr fraction FNF

Using similar notation, one can also talk about the prevalence of a disease in a population as "P(D+)". The false negative fraction is the same as one minus the true positive fraction, and similarly, FPF = 1 - TNF.

In our table, TPF represents the fraction of patients who have disease, and have this corroborated by having a "high" TEST (above whatever cutoff level was chosen). FPF represents the fraction of false positives - the test has lied to us, and told us that healthy patients have really a disease. Similarly, true negatives are represented by TNF, and false negatives by FNF.

Sensitivity is the proportion of patients with disease who test positive. It is the True Positive Fraction. The sensitivity is how good the test is at picking out patients with disease. Sensitivity gives us the proportion of cases picked out by the test, relative to all cases that actually have the disease:

P(T+ | D+) = TP / (TP + FN).

Specificity is the proportion of patients without disease who test negative. Specificity is the ability of the test to pick out patients who do NOT have the disease. It won't surprise you to see that this is synonymous with the True Negative Fraction:

P(T- | D-) = TN / (TN + FP).

Pretest Probability is the estimated likelihood of disease before the test is done. It is the same thing as prior probability and is often estimated. If a defined population of patients is being evaluated, the pretest probability is equal to the prevalence of disease in the population. It is the proportion of total patients who have the disease:

P(D+) = (TP + FN) / (TP + FP + TN + FN).

Sensitivity and specificity describe how well the test discriminates between patients with and without disease. They address a different question than we want answered when evaluating a patient, however. What we usually want to know is: given a certain test result, what is the probability of disease? This is the predictive value of the test.

Predictive value of a positive test is the proportion of patients with positive tests who have disease:

P(D+ | T+) = TP / (TP + FP).

This is the same thing as posttest probability of disease given a positive test. It measures how well the test rules in disease.

Predictive value of a negative test is the proportion of patients with negative tests who do not have disease:

P(D- | T-) = TN / (TN + FN).

It measures how well the test rules out disease. Notice that this is not the same as posttest probability of disease given a negative test which is one minus the predictive value of a negative test.

Evaluating a 2 by 2 table is simple if you are methodical in your approach.


Disease present

Disease absent


Test positive

TP

FP

Total positive

Test negative

FN

TN

Total negative


Total with disease

Total without disease

Grand total

1. Choose a large number of hypothetical patients and write it in the Grand total cell.
2.
Multiply the Grand total by the Pretest probability to get the Total with disease.
3.
Compute the Total without disease by subtraction.
4.
Multiply the Total with disease by the Sensitivity to get the number of True positives.
5.
Multiply the Total without disease by the Specificity to get the number of True Negatives.
6.
Compute the number of False positives and False negatives by subtraction.
7.
Compute the Total positive tests and Total negative tests by addition across the rows.
8.
Predictive value of a positive test is True positives divided by Total positive tests.
9.
Predictive value of a negative test is True negatives divided by Total negative tests.

The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of the test. They also depend on the definition of what constitutes an abnormal test. Look at the idealized graph at right showing the number of patients with and without a disease arranged according to the value of a diagnostic test. This distributions overlap. The test does not distinguish normal from disease with 100% accuracy. The area of overlap indicates where the test cannot distinguish normal from disease. In practice, we choose a cutoff level (indicated by the vertical black line) above which we consider the test to be abnormal and below which we consider the test to be normal. The position of the cut point will determine the number of true positive, true negatives, false positives and false negatives. We may wish to use different cutoff levels for different clinical situations if we wish to minimize one of the erroneous types of test results.

Receiver Operating Characteristic (ROC) curves plot the sensitivity of a test versus its false positive rate for various points (definitely present, probably present to definitely absent) and is especially applicable when test results are interpreted subjectively. ROC analysis has wide applicability in radiology research for comparing observers, modalities, and tests.

The name "Receiver Operating Characteristic" came from "Signal Detection Theory" developed during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the Receiver Operating Characteristics. ROC curves were developed in the 1950's as a by-product of research into making sense of radio signals contaminated by noise. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results. More recently it's become clear that they are remarkably useful in medical decision-making.

A good one-class classifier will have both a small fraction false negative as a small fraction false positive. Because the error on the target class can be estimated (relatively) well, it is assumed that for all one-class classifiers a threshold can be set beforehand on the target error. By varying this threshold, and measuring the error on the (maybe artificial) outlier objects, a Receiver Operating Characteristics curve (ROC-curve) is obtained. This curve shows how the fraction false positive varies for varying fraction false negative. The smaller these fractions are, the more this one-class classifier is to be preferred. Traditionally the fraction true positive is plotted versus the fraction false positive

We can use a set of hypothyroidism data to illustrate how sensitivity and specificity change depending on the choice of T4 level that defines hypothyroidism. Recall the data on patients with suspected hypothyroidism reported by Goldstein and Mushlin (J. Gen. Intern. Med. 1987 2 20-24). The data on T4 values in hypothyroid and euthyroid patients are shown graphically (below) and in a simplified tabular form

T4 value

Hypothyroid

Euthyroid

5 or less

18

1

5.1 - 7

7

17

7.1 - 9

4

36

9 or more

3

39

Totals:

32

93

Suppose that patients with T4 values of 5 or less are considered to be hypothyroid. The data display then reduces to:

T4 value

Hypothyroid

Euthyroid

5 or less

18

1

> 5

14

92

Totals:

32

93

The sensitivity is 0.563 and the specificity is 0.989.

Now, suppose we decide to make the definition of hypothyroidism less stringent and now consider patients with T4 values of 7 or less to be hypothyroid. The data display will now look like this:

T4 value

Hypothyroid

Euthyroid

7 or less

25

18

> 7

7

75

Totals:

32

93

The sensitivity is 0.781 and the specificity is 0.806. Let us move the cutoff level for hypothyroidism one more time:

T4 value

Hypothyroid

Euthyroid

< 9

29

54

9 or more

3

39

Totals:

32

93

The sensitivity is 0.906 and the specificity is 0.419. Now, take the sensitivity and specificity values above and put them into a table:

Cutoff Level

Sensitivity

Specificity

5

0.563

0.989

7

0.781

0.806

9

0.906

0.419

Notice that you can improve the sensitivity by moving to cutoff level to a higher T4 value--that is, you can make the criterion for a positive test less strict. You can improve the specificity by moving the cut point to a lower T4 value--that is, you can make the criterion for a positive test stricter. Thus, there is a tradeoff between sensitivity and specificity.

T4 value

Hypothyroid

Euthyroid

5 or less

18

1

5.1 - 7

7

17

7.1 - 9

4

36

9 or more

3

39

Totals:

32

93

We showed that this table can be summarized by the operating characteristics at the table below:

Cut point

Sensitivity

Specificity

5

0.563

0.989

7

0.781

0.806

9

0.906

0.419

The operating characteristics can be reformulated slightly and then presented graphically as shown below:

Cut point

True Positives

False Positives

5

0.563

0.011

7

0.781

0.194

9

0.906

0.581

This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cut points of a diagnostic test. An ROC curve demonstrates several things:

1. It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).

2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

4. The slope of the tangent line at a cut point gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 < 5 is 52. This corresponds to the far left, steep portion of the curve. The LR for T4 > 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.

5. The area under the curve is a measure of test accuracy.

The graph abovet shows three ROC curves representing excellent, good, and worthless tests plotted on the same graph. The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

* 0.90 – 1.00 = excellent

* 0.80 - 0.90 = good

* 0.70 - 0.80 = fair

* 0.60 - 0.70 = poor

* 0.50 - 0.60 = fail

The area measures discrimination, that is, the ability of the test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups. You randomly pick on from the disease group and one from the no-disease group and do the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which this is true (that is, the test correctly classifies the two patients in the random pair).

Although the ROC curve gives a very good summary of the performance of a one-class classifier, it is hard to compare two ROC curves. One way to summarize a ROC-curve in a single number is the Area under the ROC (AUC). This integrates the fraction false positive over varying thresholds (or equivalently, varying fraction false negative). Smaller values indicate a better separation between target and outlier objects. Note that for the actual application of a one-class classifier a specific threshold (or fraction false negative) has to be chosen. That means that only one point of the ROC-curve is used. It can therefore happen that for a specific threshold a one-class classifier with a higher AUC might be preferred over another classifier with a lower AUC. It just means that for that specific threshold, the fraction false positive is smaller for the first classifier than the second classifier.

Recall the T4 data. The area under the T4 ROC curve is .86. The T4 would be considered to be "good" at separating hypothyroid from euthyroid patients.

Two methods are commonly used to compute AUC: a non-parametric method based on constructing trapezoids under the curve as an approximation of area and a parametric method using a maximum likelihood estimator to fit a smooth curve to the data points. Both methods are available as computer programs and give an estimate of area and standard error that can be used to compare different tests or the same test in different patient populations. For more on quantitative ROC analysis, see:

Metz CE, Basic principles of ROC analysis, Sem. Nuc. Med. 8 283-298 (1978)

Google

Web

www.mlahanas.de

BACK