Statistical equivalence tests and non-inferiority tests

The equivalence problem

Demonstrating the equality of properties between two groups, or proving that a difference = 0, is not uncommon in clinical studies or experiments: For example, it may be necessary to show that a more cost-effective therapy is just as effective as a more expensive one, that a certain physiological characteristic does not differ between two groups, or that in a robustness study the measured values do not change significantly compared to the undisturbed measurement.

The classic subject of investigation is bioequivalence testing for pharmaceuticals. This examines whether the time curves of blood concentration, characterized by an area or its peak (see figure), differ between two different administration methods of a drug.

Just as with difference, one can also consider the quotient of two quantities: a ratio equal to 1 expresses equivalence. This is the case, for example, in the bioequivalence test. Another application is the demonstration of the equivalence of the diagnostic accuracy of two diagnostic tests: the equivalence of sensitivity and/or specificity is expressed using the ratios rTPF (ratio of "true positive fractions" TPF, where TPF = sensitivity) and rFPF (ratio of "false positive fractions" FPF, where FPF = 1 - specificity), which ideally equals 1.

The non-inferiority problem

A related problem should be briefly mentioned: the demonstration of non-inferiority. This is considered when equivalence is only of interest in one direction (no deterioration), but superiority (and thus deviation from equivalence in one direction) is not a problem or even desirable. An example is the demonstration of the absence of carry-over in measuring instruments used in laboratory medicine: here, the only concern is that samples with a high concentration do not lead to falsely elevated values in later measured samples (e.g., due to carryover). Another example is the comparison of side effects; these should not occur more frequently in the comparison, while a lower incidence is not a problem.

Procedure for the equivalence test

To demonstrate equivalence, one would initially assume a difference of 0. However, this is the ideal case; in reality, a certain range is permitted within which a difference is considered irrelevant. Thus, an equivalence region centered around 0 is defined, bounded by equivalence limits.

Statistical proof can be provided by considering the so-called estimator (e.g., a mean) together with its uncertainty range (described by the confidence interval). The confidence interval must lie within the equivalence limits to demonstrate equivalence (top figure).

Alternatively, statistical hypothesis tests can be used. Here, two null hypotheses are formulated: The mean lies below or above the limits (see middle of the figure). The alternative hypothesis, on the other hand, assumes that the difference lies within the range. If both null hypotheses are rejected, then the hypothesis has been proven.

When considering means, two one-sided t-tests are used (TOST: two one-sided t-tests). [Schuirmann DJ. A Comparison of the Two One-Sided Tests Procedure and the Power Approach forAssessing the Equivalence of Average Bioavailability. J of Pharmacokinetics and Biopharmaceutics 1987; 15(6): 657-680]).The term TOST is now often used synonymously with equivalence testing in general, although strictly speaking it refers to t-tests and mean comparisons. TOST is now found in many software programs. The term TOST is understood to encompass the equivalence testing approach.

Note: You can find two options on our website Excel tools for the application of the TOST equivalence test, including tabulation of required sample sizes (Fig. below).

Determination of equivalence limits

The main problem in conducting equivalence tests is the prospective (!) determination of the equivalence limits. Firstly, this is a substantive and no statistical Question. Nevertheless, the determination of equivalence limits occupies a large space in the statistical consulting One advantage is that larger areas result in a smaller sample size, and proof is easier to obtain. On the other hand, the validity of the proof may be limited, and it may not be accepted by authorities, for example.

This question can be approached through the following considerations:

Which difference is not relevant?
"A difference that makes no difference."
What is the minimum difference of interest (MID) - the equivalence range should be somewhat smaller, e.g. 0.7 times.
How large is the measurement uncertainty or the biological variability - here too, the equivalence range should be smaller.

In the field of bioequivalence studies, the limits are set by the authorities: "A decision in favor of bioequivalence will be accepted when the parametric confidence intervals do not exceed the limits of 80% and 125% for the ratio of AUC values and for the ratio of Cmax values. The decision procedure is based on 90% confidence intervals."

Statistical equivalence tests as an important evaluation tool in method validation

Method validation experiments often aim to demonstrate that a target variable equals zero. For example, in a method comparison, the goal might be to show that the bias (systematic error) of the test method is negligible (zero) compared to a reference method. Similarly, in a robustness or stability study, the goal might be to demonstrate that no relevant changes occur.

The common approach of stating, "The test for difference shows no significant difference, therefore the groups are equal with respect to the characteristic under investigation," is statistically incorrect. Such a result provides an indication, but not proof. This is because significance tests can demonstrate the rejection of the null hypothesis (which states equality), but not its acceptance.

Therefore, if the goal of a project is to demonstrate equivalence, the appropriate tests are needed: the equivalence tests.

While studies aimed at demonstrating equivalence or non-inferiority are widespread in the pharmaceutical industry and have always been adequately evaluated (since the 1990s, the corresponding tests have been referred to as equivalence tests), the laboratory diagnostics community is finding it extremely difficult to adopt this methodology. The first publication known to us [Lung KR, Gorko MA, Llewelyn J, Wiggins N. Statistical methodfor the determination of equivalence of automated test procedures.J Autom Methods Manag Chem 2003;25:123-7] had no effect.

We have published a corresponding procedure for investigations into carry-over, for demonstrating commutability, and for comparing methods [Keller T, Brinkmann T (2014). Proposed Guidance for Carryover Studies, Based on Elementary Equivalence Testing. Clin. Lab 7.1153-61; Keller T, Weber S (2009): Statistical Test for Equivalence in Analysis of Commutability Experiments. CCLM 47, 376-377 (Download Poster); Keller T, Faye S, Katzorke T (2011): Statistical Test for Equivalence in Analysis of Method Comparison Experiments. Application in comparison of AMH assays. CCLM 49:806 (Download Poster)].

This approach is now slowly finding its way into the community [Holland MD, Budd JR, et. al. (2017): Improved statistical methods for evaluation of stability of in vitro diagnostic reagents, Stat Biopharm Res, 9:272-278], ), even though the test is not yet referred to as an equivalence test in the case of commutability [Nilsson G, Budd JR, Greenberg N, Delatour V, Rej R, Panteghini M, Ceriotti F, Schimmel H, Weykamp C, Keller T, Camara JE, Burns C, Vesper HW, MacKenzie F, Miller WG (2018). IFCC Working Group Recommendations for Assessing Commutability Part 2: Using the Difference in Bias Between a Reference Material and Clinical Samples. Clin Chem 64:455–464].

Figure: Carry-over as a non-inferiority problem, Fig. from Keller T, Brinkmann T (2014). Proposed Guidance for Carryover Studies, Based on Elementary Equivalence Testing. Clin. Lab 7, 1153-61