ࡱ > A C @ 4% bjbj 0, 4 s s s s s d L c e e e e e e , P Z s s s " s s c c nm4
O 0 \ \ s 0 \ 9 :
SUPPLEMENTARY MATERIALS
Statistical Plan:
In a hypothesis testing setting:
Primary Objective
H0: At a threshold corresponding to 98% specificity in non-failures, sensitivity is 15% in failures within 5-years.
H1: Sensitivity is at least 30% for failures within 5-years at the threshold corresponding to 98% specificity.
Secondary Objective
H0: At a threshold corresponding to 98% sensitivity in failures within 5-years, specificity is 10% in non-failures.
H1: Specificity is at least 25% at the threshold corresponding to 98% sensitivity for failures within 5-years.
Justification: Based on preliminary data from the University of Washington and Stanford University, Gleason score at a threshold of 8 or greater exhibits 15% sensitivity at 98% specificity. For this study, we assume a biomarker must demonstrate sensitivity at 98% specificity double that of the Gleason score, or approximately 30%to be considered clinically useful for the primary objective of identifying recurrent disease. Hence, we will estimate the sample size needed to achieve 90% power to detect sensitivity of 30% or greater at 98% specificity. For the secondary objective, we will consider further validation of a candidate biomarker if it can correctly identify at least 25% of men with extremely low risk of recurrence while maintaining 98% sensitivity in detecting failures. Therefore the primary outcome of this validation study is the sensitivity corresponding to 98% specificity in non-failures at 5-years post RP. The secondary outcome is the specificity corresponding to 98% sensitivity in failures at 5-years post RP.
Data analysis: For discovery and model building, Cox regression for case-cohort design will be used to analyze the time-to-recurrence data. The sampling probabilities for different sites will be used as a weighting factor to make the analysis relevant to the whole RP cohort ADDIN EN.CITE Samuelsen200759[27]595917Samuelsen, S.O.Stratified case-cohort analysis of general cohort sampling designsScand J Stat.Scand J Stat.103-119342007[ HYPERLINK \l "_ENREF_27" \o "Samuelsen, 2007 #59" 27]. The linear score obtained from Cox regression will be treated as a composite biomarker and a threshold will be identified for desired sensitivity/specificity. For validation analysis, we assume the test, either a single biomarker or a composite biomarker, has been developed from discovery study or from other studies. For the primary objective, sensitivity at 98% specificity will be estimated from the data and a 95% one-sided confidence interval will be constructed. The sampling probabilities will be used to weigh different groups in calculating sensitivity/specificity for the whole cohort, making the estimates more applicable for general RP population. If the biomarker is a 4-level ordinal single biomarker, we will calculate sensitivity and its 95% confidence interval at the level corresponding to at least 98% specificity (assuming it is achievable). If the confidence interval covers the null hypothesized sensitivity 0.15, we will conclude that the candidate biomarker does not meet our criteria of performing significantly better than 15% sensitivity at 98% specificity. Similarly, for the secondary objective, we will construct a 95% one-sided confidence interval for the specificity at 98% sensitivity. If the confidence interval covers the null hypothesized specificity 0.10, we will conclude that the candidate biomarker does not meet our criteria of performing significantly better than 10% specificity at 98% sensitivity.
To examine the effect of surgery site on biomarker levels, we will use a Cox regression model including biomarker, site indicator, and site indicator by biomarker interaction terms in the model. If the interaction term is statistically significant, we will first investigate if site-specific standardization (such as using percentile in controls within site) will eliminate such interactions and enable us to create a combined ROC curve
Sample size: Sample size calculations are performed using methods that optimize the ratio of cases to controls while minimizing the total number of samples as described by Pepe ADDIN EN.CITE Pepe200358[28]58585Pepe, M. S.The Statistical Evaluation of Medical Tests for Classification and Prediction220-2232003Oxford University Press9780198509844[ HYPERLINK \l "_ENREF_28" \o "Pepe, 2003 #58" 28].
Pepe provides the ratio of recurrent samples to non-recurrent samples that minimizes the total number of samples required. The optimal ratio would be k=0.45 for the primary objective and k=2.2 for the secondary objective. However, since both tests are of interest and all of the cases and controls will be used for the evaluation of the two tests, the optimal ratio is k=1, i.e. a 1:1 ratio of cases to controls. After using the equation above and a parallel calculation for the secondary objective, we estimate 391 and 463 cases are needed to achieve adequate power for the primary and secondary objectives, respectively. Taking the larger value 463 and inflating it to account for failed assays, the final sample size is 525 recurrent cases and 525 controls. Each of at least five sites will contribute 105 recurrent and 105 non-recurrent samples for this validation study. Each site should strive to contribute equal number of cases and controls. In particular, the imbalance between the number of cases and controls from each site should be less than 20%.
The sample size calculated here is adequate for a confirmatory or validation study on a selected biomarker or panel of biomarkers in which the combination rule is specified during the discovery stage. To enable discovery of new candidate biomarkers, we will create a second set of TMAs composed of 105 recurrent and 105 non-recurrent samples from each site. These discovery TMAs will be split into two rounds of triage. The first round, consisting of a small number of samples, will be used for initial selection. The second round, consisting of the remaining samples, will be reserved for those tissue markers demonstrating promising performance in the first round. Typically, sample size calculations are not performed for triage sets because of the large number of factors that impact power and sample size, e.g. the total number of candidate markers, the possibility of iterative selection, etc. However, we believe a total sample size of 525 recurrent and 525 non-recurrent samples will be sufficient.
) * J K ] @ T 9
G
F T X Y Z M N O Q R S T K a b v (ju h CJ UaJ mH nH u (j h CJ UaJ mH nH u "j h CJ UaJ mH nH u h CJ aJ mH nH uj h CJ UaJ h!6 h 6CJ aJ h CJ aJ h!6 h CJ aJ h!6 h 5CJ aJ h 5CJ aJ . * K ] @ T 8
9
F B! 3% 4% gd d gd 3% 4% hj7 h!6 h CJ aJ h CJ aJ j h CJ UaJ h CJ aJ mH nH u . :pj7 / =!"#$% u D y K
_ E N R E F _ 2 7 u D y K
_ E N R E F _ 2 8 6 6 6 6 6 6 6 6 6 v v v v v v v v v 6 6 6 6 6 6 > 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 h H 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 0 @ P ` p 2 ( 0 @ P ` p 0 @ P ` p 0 @ P ` p 0 @ P ` p 0 @ P ` p 0 @ P ` p 8 X V ~ OJ PJ QJ _HmH nH sH tH H ` H N o r m a l CJ PJ _HaJ mH sH tH D A D
D e f a u l t P a r a g r a p h F o n t R i R
0 T a b l e N o r m a l 4
l 4 a ( k (
0 N o L i s t PK ! pO [Content_Types].xmlj0Eжr(]yl#!MB;.n̨̽\A1&ҫ
QWKvUbOX#&1`RT9<l#$>r `С-;c=1g