Panel discussion 2

Gordon Lan, Stacy Lindborg, James Neaton

Research output: Contribution to journalArticlepeer-review

Original languageEnglish (US)
Pages (from-to)736-740
Number of pages5
JournalClinical Trials
Issue number6
StatePublished - Dec 2012

Bibliographical note

Funding Information:
736 740 © The Author(s), 2012 2012 The Society for Clinical Trials Gordon Lan (Johnson & Johnson) : I would like to discuss estimation of treatment effect in the face of trial closure. However, I want to modify the topic a little bit in the sense that I will talk about boundary crossing instead of early stopping. For a design with four interim analyses and a final analysis, boundary crossing includes either crossing the boundary at an interim analysis, or going to the final analysis and crossing the boundary then. In other words, we consider those sample paths falling into the so- called rejection region. Dr Wittes tried hard to convince us that in a group sequential design, it is reasonable to consider all sample paths crossing the boundary to be more extreme than those that do not because the sample paths that do not cross any boundary fall into the acceptance region. When we are talking about sample paths crossing the boundary, we might be talking about the p-value being less than or equal to 0.025. So, for group sequential design studies, we are interested in estimation of the treatment effect if the boundary is crossed, or if the p-value is less than 0.025. Let me consider a simple case with a fixed sample size design trial and discuss estimation of the effect size if the p-value is less than 0.025, which is equivalent to a Z-value greater than 1.96. Note that there is no early termination for such a design. For example, I want to compare two means with the traditional Z statistic, and at the end of the study, I want to estimate the standardized treatment effect Δ = (µ A − µ B )/σ. Note that at the end of the study, I expect my Z-value to be EZ = Δ √(N/4), assuming a 1:1 patient allocation ratio. If I call EZ = θ, then θ is proportional to Δ. In the following, I will refer to the estimate of θ instead of Δ. If we have an estimate of θ, then you will have an estimate of Δ, and vice versa. For a fixed design, if we pick a one-sided α = 0.025 and 85% power, how do we determine the sample size N? Well, what do we expect of the final Z-value? For a fixed design, it is easy, that is, EZ = θ = Δ √(N/4). Equate this θ = Δ √(N/4) to z α + z β = 1.96 + 1.04 = 3 and then solve for N for a given Δ. For example, if Δ is 0.3, N = 400. This is what we do when we develop a protocol; however, keep in mind that θ is not exactly 3 unless the treatment effect Δ stated in the protocol is correct. Most of the time, the choice of Δ is not correct and θ is not 3, and we just have no idea what the exact value of Δ is. In other words, when we design such a study, only when the treatment effect Δ stated in the protocol is the correct one, do we expect θ to be 3. In general, the final Z-value provides an unbiased estimate of θ. Then what is the problem? The problem here is that we expect Z to be θ unconditionally. However, if we estimate the treatment effect only under the condition that the p-value is less than 0.025 (Z ≥ 1.96), then Z is NOT unbiased. If that is the case, we overestimate θ or Δ when Z ≥ 1.96 (and underestimate θ when Z < 1.96). Do we want to estimate θ or Δ conditional on Z ≥ 1.96 then? I am making an assumption that we are more interested in this case. How much is the bias? Well, it depends. If θ = 1.0, the bias is 1.49; if θ = 2.0, the bias is 0.7725, and so on. Therefore, the bias depends on the true value of θ. If θ is unknown, how can I adjust for the bias? If θ is known, why do I need this clinical trial? In other words, there is no way we can adjust for the bias, but just like Dr Wittes pointed out earlier, no matter what I say, it cannot be right, and no matter what I say, I cannot be wrong either. Let’s try a naïve approach. Remember that we are interested in the estimation of θ when Z ≥ 1.96 and Z is an overestimator. As an estimator of θ, we may reduce Z to a value closer to 1.96 but not going beyond it since, if Z is reduced to a value less than 1.96, we don’t even have a positive study. A similar argument applies to the estimation of θ when Z is less than 1.96. In other words, we try to adjust our estimate of θ from Z to a value closer to 1.96. I call this value Z, and here is an example with a shrinkage factor of 0.9. When Z is greater than 1.96, Z − 1.96 is the part exceeding 1.96. We may multiply this part by a factor of 0.9 and estimate θ by Z = 1.96 + 0.9(Z − 1.96). For example, if Z is 2.66, our estimate of θ would be 1.96 + 0.9(2.66 − 1.96) = 2.59, which is between 2.66 and 1.96. But this is so simple, almost no mathematics. If you are not happy with this simplicity, then we may also take a more sophisticated approach. Now I consider θ as random with a prior distribution N(1.96, τ 2 ) and rewrite E(Z|θ) = θ and Var(Z|θ) =1. Now the random vector (Z,θ) is bivariate normal. The conditional expectation θ = E(θ|Z) provides an estimator for θ. This corresponds to a shrinkage estimator toward 1.96 with a factor of τ 2 /(1 + τ 2 ). If I take τ 2 to be 9, then the shrinkage factor becomes 9/(1 + 9) = 0.9. This is same as the factor I used earlier, but the mathematics I use here is a lot more sophisticated. Now, can we apply this idea to the estimation of treatment effect in group sequential trials? My gut feeling is yes, but the mathematics is a lot more complicated and I’m still working on it. Dr Wittes said that for estimation of treatment effect in a group sequential trial, we have many choices, including the choice of a shrinkage factor. I used to work for Janet, so I have to agree with her. For a Bayesian approach, we may pick any prior distribution for θ. If we pick a normal prior distribution, then we have to choose a value for the variance τ 2 . While it looks as if we have solved almost all of the problems in sequential estimation, there is still one more problem left. That is: when we have too many choices, we have a problem. Stacy Lindborg (Biogen Idec) : My perspective and what I will be covering today comes from my time in industry. I cannot begin without at least commenting on the backdrop of what the pharmaceutical industry is facing. We are facing unprecedented challenges to our basic business model; ‘business as usual’ is really no longer an option. You’ve heard some of the cost estimates of clinical trials and I saw a few facial expressions in the audience asking ‘Is that really correct?’ The rising level of these costs is being seen in the face of at least three fundamental changes that are occurring. First there are substantial patent losses; it has been estimated that between 2010 and 2014, US$209 billion in revenue is going to be lost across large Pharma. Eli Lilly routinely spends 21% of its revenue in future research and development (R&D). So a substantial amount of its revenue gets funneled back into looking for new treatments. The second change is that in the past, industry trials were more based on regulatory challenges. Now pricing authorities around the world require evidence of both economic value as well as significant therapeutic value. The third change is that we are increasingly seeing higher regulatory hurdles to getting drugs on the market. The industry goal is to get the right answers as quickly as we can. We want to keep the price of drugs as low as we can, but we want to have great certainty about the decisions that we are making; we want to terminate or kill ineffective drugs as early as we can in the pipeline and get more information on drugs that patients will eventually see. We’ve spent quite a bit of time talking about adaptive designs, and I agree that we should be focusing on a broader term. I tend to use the phrase ‘clinical trial optimization’ instead of just purely adaptive designs, to encompass a broader set of solutions that we are bringing to the table. However, as we consider more novel designs that look very different from textbook designs, we see new hurdles. The amount of work required to work out the operational side should not be underestimated. I have been involved in exploring the implementation of adaptive designs, true ‘adaptive by design trials’, at the early stages of drug development. It was clear that the science made sense – you were coming out with more information on which to base decisions. But, often a small operational hurdle led to conversations about reverting to a more traditional design that everyone understood was worse. It was really quite amazing. We now have more tools available to do simulations and make informed decisions about the best designs to bring forward, but using these tools can be complicated. It is not always easy getting these trials accepted, whether by the company, contract, or academic research organizations (AROs) involved in the running the trials, or ultimately by regulators. We are constantly looking for ways to improve the way we are running trials. It is not unusual right now to enhance the contributions of AROs as we think about how we go after the sites that are going to give the highest quality, that really know the patients and the area, and that are going to be able to enroll in a predictable fashion. It’s very interesting as you are having discussions about your protocol and design that what has historically been an internal conversation now becomes a broader conversation. The concepts of risk and obligation take on a very different view if you are in the industry versus if you are a principal investigator managing a trial. Optimization, whether it relates to an adaptive design or some other approach, is not all about money and savings. In many cases, we have actually chosen to keep the trial the same length but gain richer information and therefore have more data on the experimental arm. Placebo response is something we are seeing to a greater degree, even in areas where we didn’t expect it. In schizophrenia trials, for example, even with trials one year in length and with a placebo arm for regulatory purposes, we are seeing placebo response rates as high as in depression trials, so how we are approaching our designs is drastically changing. I would say that what often used to drive the size of a placebo arm was the required exposure by regulators. Now placebo exposure is becoming a focus and a really critical part of our designs. Even when you are looking at dose response studies, there are times where we have more data on placebo patients than we do on any of the doses that we are exploring to ensure that we have an effective interpretive trial at the end. I am picking schizophrenia as an example, but there are many other areas where we are seeing a greatly inflated placebo response. With regard to the idea of substantial evidence , there really is nothing new in the regulatory documents. US Food and Drug Administration (FDA) produced a guideline in 1998 [ 1 ] that provided some explanation of what the FDA thinks about substantial evidence . However, there is an evolving and rising expectation as it relates to addressing regulatory endpoints as well as demonstrating the global health economic benefits, and it is playing out in terms of the design requirements that are required for inclusion in labeling. Missing data is a reality in every large trial. Based on the recent National Academy of Sciences report [ 2 ], we are taking a new tack in approaching this issue. In the past, we would have picked a primary analytical method and might have followed that with exploratory or sensitivity analysis. The trend now is to look at an entire suite of analyses and see what we can learn from the missingness. When I was new to industry, I lived the words that Dr Snapinn shared in terms of the lack of multiplicity control with multiple endpoints. I do think control of Type 1 error is fundamental in phase 3 trials. We can think about drug development as a giant diagnostic testing procedure where we are trying to diagnose whether a molecule works or not. We can learn a lot from the study of diagnostics that we can apply to managing our portfolio. We really need to be thinking about the decisions that we can make and the associated decision errors, thinking wisely about how we are managing our portfolio. James Neaton (University of Minnesota) : I will focus my comments more on Dr Buyse’s presentation because I’m particularly concerned about the large costs associated with site monitoring. There is no question it is expensive, and good evidence to guide how much we should do is lacking. A few years ago, I obtained some data from National Institutes of Health (NIH)-sponsored trials and the numbers were staggering: US : I will focus my comments more on Dr Buyse’s presentation because I’m particularly concerned about the large costs associated with site monitoring. There is no question it is expensive, and good evidence to guide how much we should do is lacking. A few years ago, I obtained some data from National Institutes of Health (NIH)-sponsored trials and the numbers were staggering: US$10,000 for a site visit in the US, which typically lasted a couple of days, and US$18,000 for an international visit. These are very high costs particularly when you begin to consider this was in the context of trials that involved several hundred sites. 0,000 for a site visit in the US, which typically lasted a couple of days, and US : I will focus my comments more on Dr Buyse’s presentation because I’m particularly concerned about the large costs associated with site monitoring. There is no question it is expensive, and good evidence to guide how much we should do is lacking. A few years ago, I obtained some data from National Institutes of Health (NIH)-sponsored trials and the numbers were staggering: US$10,000 for a site visit in the US, which typically lasted a couple of days, and US$18,000 for an international visit. These are very high costs particularly when you begin to consider this was in the context of trials that involved several hundred sites. 8,000 for an international visit. These are very high costs particularly when you begin to consider this was in the context of trials that involved several hundred sites. Dr Shepherd talked about a very logical sampling approach to do audits. It is unclear why it is not done more. Groups responsible for on-site monitoring do not appear receptive to that thinking. Instead, the idea is that we have to look at everything because we can’t let even one serious error get through. Two points concerning Dr Shepherd’s presentation: Dr. Shepherd asked about the applicability of his work to clinical trials. I think in trials you should focus on getting the randomization right, and then once you have done that make certain you follow all the participants and ascertain all the outcomes appropriately for the two treatment groups. That is where the quality assurance plans should be focused. The second point is that the truth may lie somewhere between what an auditor finds in the chart and what is reported. In some trials, we have implemented an adjudication process between the site and auditor because auditors were identifying events incorrectly as endpoints. I want to talk a little bit about how the cost–benefit of on-site monitoring can be experimentally studied, and I will talk briefly about an on-site monitoring we are about to initiate, about a case of data alteration, and other regulatory impediments to doing trials. First, with regard to monitoring, a recent survey of monitoring practices was carried out by the Clinical Trials Transformation Initiative [ 3 ]. The authors concluded that more research was needed to provide an evidence base for monitoring. One way to do this is along the lines that I suggested earlier for studying adherence interventions – embed a randomized study on monitoring into a planned trial. We have done that for a large HIV trial of early therapy. We did not design the study as a non-inferiority trial because it was not clear how to determine the non-inferiority margin. Instead, the trial is assessing the superiority of on-site monitoring at least once a year with local and central monitoring of the database versus local monitoring and central monitoring of the database alone. One of the issues that we discussed at length was the primary and secondary outcomes. In addition to possibly identifying more protocol violations and missed outcomes, training associated with on-site monitoring could lead to better enrollment and follow-up. Thus, these are also outcomes in the planned trial. Several years ago, a case of data alteration was discovered in the Multiple Risk Factor Intervention Trial (MRFIT), and described in the study by Neaton et al . [ 4 ]. The alleged alteration was brought to the attention of NIH by a reporter. We analyzed the data from the site for several weeks, for evidence of data alteration. Ultimately, we went to the site and reviewed individual case report forms for the time period in question, and during other time periods as control. We looked for markouts of the data. Markouts occurred frequently – a mistake was made and a correction was noted on the case report form. These were usually initialed and dated as instructed. Over the time period in question, markouts were much more frequent, and the markouts made 22 participants who were ineligible before the markout, eligible afterwards. It was determined that the data had been inappropriately altered. We concluded that such data alterations would be very difficult to detect with any quality control scheme and that the focus should be on discussing misconduct and fraud with the investigators with the aim of preventing it from occurring. In addition to site monitoring, there are many other costs associated with regulatory guidelines. I will mention a few related to the European Union (EU) guidance document on clinical trials [ 5 ]. NIH has recently been advised by their counsel that they should not sponsor trials in the EU because of insurance and indemnification requirements. This concern delayed the initiation of a large HIV trial [ 6 ] and could jeopardize the participation of investigators in Europe in NIH-funded studies. Trial insurance costs for each EU member state are highly variable for reasons that are not clear. Another area of major costs relates to the labeling and distribution of study drugs. This also requires harmonization and simplification that I believe could substantially reduce the cost of international trials. In thinking about Dr Wittes’ presentation on estimation of effect size in trials stopped early, I begin to wonder just how many clinical trials are stopped early. Montori et al . [ 7 ] indicated that fewer than 1% of trials published in top-impact journals had been stopped early. A recent paper in Clinical Trials [ 8 ] notes that the impact on estimation is going to depend on the information fraction when the trial is stopped. So for some of the 1% of trials that stopped early, the overestimation of the effect size will be small. In many situations, one may be able to estimate the regression effect by the time of publication because following the decision to stop, there may be additional events through the chosen closing date that are reported and/or adjudicated by an event committee. One other related point is that sometimes we greatly mis-estimate the effect of a treatment based on a trial stopped early because we later discover that the benefit of the treatment changes with longer follow-up. This is evident from an overview of early trials of AZT [ 9 ]. Several studies were stopped early for benefit, but we later learned that the benefit waned with follow-up due to drug resistance. AZT had some benefit for about 18 months, and then it had no benefit. Other HIV examples relevant to this discussion were early studies of a new class of drugs for HIV, the protease inhibitors. Two studies were stopped early. In one of the papers, the trial of indinavir stopped early [ 10 ]. The authors addressed the issue of overestimating the effect size, noting that it was likely but they were not sure how to adjust the estimate. In contrast to the AZT example, we later learned that the effects of the protease inhibitors in these studies were likely underestimated. With durable viral load suppression, CD4+ counts continued to rise and risk of AIDS and death was reduced more relative to controls with longer follow-up. With regard to Dr Snapinn’s paper, I think we are stuck with multiple endpoints. He is right that terminology is not good and appears to be getting worse. However, I would be very reluctant to give up the term ‘primary’. It took a while to get investigators to focus on a major outcome for the comparison. Most trials have secondary outcomes, and in the regulatory setting where labeling is a consideration, I would try to reduce the length of the list and, in some cases, it may be reasonable to carry out a global test before performing individual endpoint comparisons to control type 1 error. Participants Gordon Lan : Senior Director of Quantitative Decision Strategy at Janssen, Pharmaceutical Companies of Johnson and Johnson, Raritan, New Jersey Stacy Lindborg : Sr. Director, Biostatistics at Biogen Idec. Cambridge, Massachusetts James Neaton : Professor of Biostatistics, University of Minnesota

Cite this