Diagnostic studies: Some biometric basics

What exactly is diagnostics?

No answer will be given here, as this would require starting with philosophical discussions of the cognitive process... and continuing with the principles of medical practice. The terms illness, diagnosis, diagnostics, test, diagnostic process, etc., would need to be defined. The biostatistician, however, can keep it simple: he views the diagnostic measure as a means of transforming an a priori probability of the correctness of the assumption that a patient suffers from an illness into a (preferably) higher a posteriori probability.

From diagnostics to diagnostic testing

To describe diagnostics, it is understood as a sequence of binary individual decisions. These individual distinctions are made using diagnostic tests that aim to decide between two states: disease present / absent. Accordingly, the test result is also a yes/no statement: sick (=positive) / not sick (=negative). In tests with quantitative results, such as laboratory values, the conversion into such a binary statement is achieved using a cutoff value (Cut-off point).

From this, a four-field table can be generated that compares the patient's true condition (reference standard, gold standard) and the test result (test performed = index test):

	Reference standard: Disease D	Reference standard: Non-disease D-
Index test: test positive T	Truly positive test result: TP	False positive test result: FP
Index test: test negative T-	False negative test result: FN	True negative test result: TN

Measures of diagnostic accuracy

This table allows us to calculate the ratios of individual cells to the sums, both column-wise and row-wise.

First, we can state the prevalence: D+/([D+] + [D-]) = (TP+TN)/(TP+FP+TN+FN).

Diagnostic accuracy is always reported as a pair of two values: {sensitivity, specificity}, or {PPV, NPV} or {DLR+, DLR-}.

Summary measures that use only one value (e.g., sum of sensitivity and specificity, Youden index, efficiency, etc.) inadequately reflect the quality of a test and are unsuitableTo describe the diagnostic accuracy, a specific number is needed. If only a number is reported (e.g., only the negative predictive value, or an "accuracy of 95%," etc.), this information is insufficient to assess the quality of a diagnostic test. Personally, in such cases, I assume that an unfavorable characteristic of the test is being concealed.

Sensitivity determines the proportion of correctly identified positive patients among all patients: TP/D+ = TP/(TP+FN). Specificity determines the proportion of correctly identified negative patients among non-patients: TN/D- = TN/(TN+FP). These values can also be expressed statistically as conditional probabilities: Sensitivity is the conditional probability of a true positive test result given the presence of the disease [noted as P(T+|D+)], and specificity is the conditional probability of a true negative test result given the absence of the disease: P(T-|D-). Sensitivity and specificity are the metrics that developers and manufacturers use to evaluate their diagnostic tests. Sometimes, instead of sensitivity, the true positive rate (TPF) = sensitivity and the false positive rate (FPR) = 1 - specificity are also given. Since sensitivity and specificity are determined within the columns of the table reported above, they do not depend on prevalence.

In contrast, predictive values (row-wise analysis) consider the probabilities that the patient actually has the condition indicated by the test (positive predictive value PPV: TP/(TP+FP), negative predictive value NPV: TN/(TN+FN). These predictive values can also be statistically formulated as conditional probabilities: The PPV is the conditional probability of a disease being present given a positive test result P(D+|T+) [note the reversed order of T+ and D+ compared to sensitivity], while the NPV is the conditional probability of not having a disease given a negative test result P(D-|D-). The predictive values thus describe the perspective of the physician (or the patient) who has the test result: I have a positive test result: what is the probability that I am actually ill? The PPV answers this question. Physicians and patients can use the PPV and NPV, respectively, to assess the relevance of the test result.

Excel tool for calculating diagnostic accuracy measures (free download)

Berechnung der diagnostischen Güte Sensitivität, Spezifität, PPV, NPV, DLR

How can one assess the quality of predictive values? This will be illustrated using the example of the Pap test, a screening test for the presence of cervical lesions (prevalence = 0.8%). The sensitivity is 55%, and the specificity is 97%. It can then be calculated that the Pap test has a (seemingly small) positive predictive value of around 12.8%, yet it is still considered a good test. This assessment is reached by comparing the positive predictive value (PPV) (12.8%) with the prevalence (0.8%), or the negative predictive value (NPV) (99.6%) with the prevalence of 1 (99.2%). Therefore, in the case of a positive test result, the Pap test offers a significant increase in information, as the PPV is considerably higher than the prevalence (12.8% vs. 0.8%).

The positive diagnostic likelihood ratio (DLR+), which, together with the negative DLR-, can also be used as a measure of diagnostic accuracy, represents this information gain.direct The ratio of post-test odds to pre-test odds is used here. In practice, DLR+ is calculated as sensitivity/(1-specificity) [negative DLR: (1-sensitivity)/specificity]. For the Pap test, this results in a DLR+ of 18.3 and a DLR- of 0.46. The diagnostic likelihood ratio is therefore the measure of diagnostic accuracy that best reflects the information gained by the test. It is also the only measure whose absolute values can be directly evaluated. A rough rule of thumb is that a test is "good" if DLR+ > 3 and DLR- < 0.33. Not many in-vitro diagnostic (IVD) tests achieve this; the tumor marker Cyfra 21-1 is one such example, shown in the figure below left.

The following image shows the calculation of the measures of diagnostic accuracy using the tumor marker Cyfra 21-1 as an example (data from Keller et al, 1998) as well as a summary table of characteristics of the measures of diagnostic accuracy (cited from Pepe 2003).

Objectives of a diagnostic study

The goal of a diagnostic study is generally to determine its diagnostic accuracy (along with the confidence interval), with the commonly used measures being the sensitivity/specificity pairs and/or positive/negative predictive value (PPV/NPV). Positive and negative diagnostic likelihood ratios (DLR+/DLR-) are sometimes requested separately by regulatory authorities or reviewers.

It is important to understand that the study approach for determining (statisticians say: estimating) diagnostic accuracy is exploratory. A confirmatory approach would consist of prospectively demonstrating that the diagnostic accuracy exceeds certain values (e.g., demonstrating that sensitivity > 60%, specificity > 80%). In practice, it is also common to encounter the situation where it is necessary to demonstrate that the sensitivity exceeds a certain value, while the specificity does not fall below a certain value (non-inferiority).

Another objective is to compare the diagnostic accuracy of different diagnostic tests. Strictly speaking, two comparisons must be formulated here, one each for sensitivity and specificity. Here, too, one frequently encounters the situation where only one of the measures needs to be superior to that of the comparator test, while for the other measure, only non-inferiority needs to be demonstrated. Suitable measures for comparative studies include rTPF (ratio of sensitivities) and rFPF (ratio of false-positive rates [=1 specificity]), or rPPV (ratio of PPV) and rNPV (ratio of NPV), as these statistical models are readily accessible (they can be directly estimated using generalized linear models with a logarithm as the link function, including confidence intervals and, if necessary, consideration of influencing factors).

Phases of diagnostic studies

How should you proceed when examining and evaluating a diagnostic test? Köbberling et al. (1989) distinguish four phases that still prove to be very practical:

Phase I: A preliminary technical investigation examines the method. This validation of the measurement properties, e.g., accuracy and precision, provides information on the quality of the method. Further information on Method validationYou can find them in the section laboratory.

Phase II: The measured values are analyzed for differences in distribution between various patient groups. This allows for an assessment of the test's potential. Phase II studies include patients for whom the diagnosis has already been confirmed. The number of cases per group is not based on the prevalence of the disease, but rather on statistical considerations.

Example: For a phase II study on the diagnostic potential of a tumor marker, 150 patients with a histologically confirmed tumor and 100 patients with an inflammatory disease of the affected organ (tumor exclusion already performed) are included. (In this case, it is important to take the blood sample before the start of therapy, as therapy affects the tumor marker level.) - This approach does not accurately reflect the spectrum of the future application population, but rather preferentially includes "sicker" patients and "healthier" non-diseased individuals.

This spectral distortion leads to an overestimation of diagnostic accuracy. In my experience, this is the main reason for the failure of many initially promising biomarkers: their diagnostic accuracy was determined in a case-control study.This can be counteracted to some extent by specifically considering different disease stages, different comorbidities, and different demographic factors when selecting patients.

A phase II study allows statements about the relationship between sensitivity and specificity of the test based on an ROC curve (ROC: receiver operating characteristics), although, as described, an overestimation is to be expected.

To create the ROC curve, the cut-off point is varied across the range of values of the diagnostic test. The relationships between the values TP, FP, FN, and TN change accordingly. In practice, every measurement from the studies is used, and sensitivity and specificity are calculated. This results in a curve like the one shown in the adjacent figure.

Link: More information about the ROC curve.

The ROC curve serves for the scientific exploration of the diagnostic accuracy of the test. It can be used to... Determining the cut-off point The area under the curve (AUC) is a general measure of diagnostic accuracy, but unlike the value pairs described above, it is unsuitable for indicating the diagnostic accuracy of a test.

Illustration:
ROC (Receiver Operating Characteristics) curve, with confidence band and individual cut-off values. Created with the ACOMED Excel tool, see web shop. Based on CYFRA 21-1 for the diagnosis of bronchial carcinoma in patients with suspected bronchial carcinoma [Keller et al. 1998].

The above example leads to the Phase III trial: In a controlled diagnostic study, the test is evaluated in the specific clinical application situation.

In a phase III diagnostic study, all patients with suspected disease are included in the study; their disease status is not yet known. This corresponds exactly to the situation in which the test would be used in routine diagnostics. The diagnostic procedure for confirming or ruling out the disease must be precisely defined and recognized (reference method, gold standard, diagnostic accuracy criterion).

Example: To evaluate a marker for myocardial infarction in general practitioners, all patients presenting with a certain set of symptoms (e.g., shortness of breath, unexplained chest pain, characteristic ECG abnormalities) must be included in the study. It is to be expected that a test successfully used in cardiac centers, for example, will perform very differently when applied in general practice, since the patient population is entirely different and the prevalence of the target disease differs.

In phase III trials, cut-off values can be established, which is more difficult than commonly assumed. Since there is always an overlap zone ("gray area") where the test yields the same results for both sick and non-sick individuals, trade-offs must be made: Are false positives or false negatives more favorable? Further Instructions for setting cut-off values.

Phase IV trials They investigate the therapeutic benefit of a therapeutic measure following the diagnostic test (efficacy studies) and answer questions such as these:

Does the introduction of a new imaging method that can identify smaller tumor foci lead to an increase in survival time?
Let's consider patients who experience side effects from certain medications. Would a diagnostic test that identifies these patients lead to a reduction in the complication rate?

Phase IV studies are complex and time-consuming to conduct, and further description will be omitted here.

Biases in diagnostic studies

Finally, some information on three systematic errors that can occur alongside other errors in the evaluation of diagnostic tests and lead to bias.

Selection bias/spectrum bias: This is the main bias in clinical diagnostic trials. The bias occurs when the selection of patients studied, or the spectrum of patients included in the study, does not correspond to the clinical application situation. This was already discussed above in the context of phase II diagnostic trials.

Verification bias: A significant bias is to be expected if the reference standard cannot be established with the same quality for all patients. For example, an invasive procedure might only be used for those who test positive, while understandably it would be omitted for those who test negative. An overestimation of sensitivity is therefore to be expected.

Lack of blinding, information bias: Knowledge of the test result of the test under investigation influences the result of the external criterion. This is particularly to be expected in procedures where findings must be interpreted (imaging procedures). A particularly common error regarding the removal of blinding is the retesting (i.e., repeating measurements or extensive testing) of discordant (i.e., false positive or false negative) cases. This is only permissible if a randomly selected subsample of concordant cases is simultaneously subjected to this procedure as well. In a FDA guideline (2007) This aspect is examined in detail in diagnostic studies.

literature

For further reading, I particularly recommend the book by MS Pepe. Regarding the phases of diagnostic studies, the two publications by Köbberling et al. are recommended.

Pepe MS (2003): The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press 2003

Zhou XH, Obuchowski NA, McClish DK (2011, 2nd ed). Statistical Methods in Diagnostic Medicine. Wiley Interscience New York.

Köbberling J, Richter K, Trampisch HJ, Windeler J: Methodology of medical diagnostics. Development, evaluation and application of diagnostic procedures in medicine. Springer-Verlag Berlin Heidelberg New-York (1991)

Köbberling J, Trampisch HJ, Windeler J: Memorandum on the evaluation of diagnostic measures. GMDS publication series (1989) 10

Begg CB: Biases in the Assessment of Diagnostic Tests. Stat. Med. (1987) 6, 411-423

Linnet K: A Review on the Methodology for Assessing Diagnostic Tests. Clin. Chem. (1988) 34, 1379-1386