1、Two-phase designs and calibrationThomas LumleyUW BiostatisticsCopenhagen | 200843Two-phase samplingPhase 1: sample people according to some probability design1;i, measure variablesPhase 2: subsample people from the phase 1 sample usingthe variables measured at phase 1, 2j1;i and measure morevariable
2、sSampling probability i = 1;i 2j1;i and use 1= i as samplingweights.These are not (in general) the marginal probability that i is in thesample, because 2j1;i may depend on which other observationsare in the phase-one sample.Two-phase samplingFor all the designs today 1;i is constant, so its value do
3、esnta ect anything except estimated population totals.We dont care about population totals in this setting, so we dontneed to know the value of 1;i (eg can set to 1).An alternative view is that we are using model-based inferenceat phase one, rather than sampling-based inference.Two-phase samplingTwo
4、 sources of uncertainty:Phase-1 sample is only part of populationPhase 2 observes full set of variables on only a subset ofpeople.Uncertainties at the two phases add.Minimal phase 1The classic survey example is two-phase sampling for strati -cation. This is useful when a good stratifying variable is
5、 notavailable for the population but is easy to measure.Take a large simple random sample or cluster sample andmeasure stratum variablesTake a strati ed random sample from phase 1 for the surveyIf the gain from strati cation is larger than the cost of phase 1we have won.The casecontrol design is pro
6、bably the most important example,although it isnt analyzed using sampling weights.Minimal phase 2Many large cohorts exist in epidemiology. These can be modelledas simple random samples. They have a lot of variablesmeasured.It is common to want to measure a new variableNew assay on stored bloodCoding
7、 of open-text questionnaireRe-interviewThe classic designs are a simple random sample and a casecontrol sample.It is often more useful to sample based on multiple phase-1variables: outcome, confounders, surrogates for phase-2 variable.AnalysisThe true sampling is two-phasepopulation cohort subcohort
8、We can ignore the rst phase and pretend that we had anunstrati ed population samplepopulation subcohortThis is conservative, but sometimes not very. It was used beforesoftware for eg Cox models in two-phase samples were developed.Two-phase casecontrolThe casecontrol design strati es on Y. We can str
9、atify on Xas wellX=0 X=1Y=0 a b m0Y=1 c d m1n0 n1The estimated variance of isvar = 1a + 1b + 1c + 1dIdeally want all cells about the same size in this trivial case.Example: Wilms TumorWilms Tumor is a rare childhood cancer of the kidney.Prognosis is good in early stage or favourable histologydisease
10、.Histology is di cult to determine. NWTSG central patholo-gist is much better than anyone else.To reduce cost of followup, consider central histology onlyfor a subset of cases.Sample all relapses, all patients with unfavorable histologyby local pathologist, 10% of remainderBreslow t) = 0where Zevent
11、 is the covariate for the person having an event andE( ;t) is the expected covariate based on a weighted averageover everyone at risk at that time.Casecohort designPrentice (1986) suggested estimating E( ;t) from the cur-rent case plus a subcohort selected at random at baseline,saving money and comp
12、utation time.Self t)for mathematical simplicity.Barlow used the subcohort and all cases, with the conserva-tive single-phase sampling standard errorsLin no hang-ups about predictable weight processes: weight of 1 forany future case.ComputationPrentice, Self ),nd calibration weights gi minimizingXRi=
13、1d(gi; 1)subject to the calibration constraintsNXi=1xi = XRi=1giixiLagrange multiplier argument shows that gi = (xi ) for some(), ; and can be computed by iteratively reweighted leastsquares.For example, can choose d(;) so that gi are bounded below (andabove).Deville et al JASA 1993; JNK Rao et al, Sankhya 2002