- Home
- Documents
*Post-Stratification and Conditional Variance Estimation they appropriately estimate the conditional...*

prev

next

out of 25

View

215Download

3

Embed Size (px)

_____________________________________________________________________________________Richard Valliant is a mathematical statistician in the Office of Survey Methods Research, U.S. Bureau ofLabor Statistics. He thanks the referees for their useful comments. Any opinions expressed are those ofthe author and do not reflect policy of the Bureau of Labor Statistics.

Post-Stratification and Conditional Variance Estimation

Richard ValliantU.S. Bureau of Labor Statistics

Room 2126 , 441 G St. NWWashington DC 20212

February 1992

ABSTRACT

Post-stratification estimation is a technique used in sample surveys to improve

efficiency of estimators. Survey weights are adjusted to force the estimated numbers of

units in each of a set of estimation cells to be equal to known population totals. The

resulting weights are then used in forming estimates of means or totals of variables

collected in the survey. For example, in a household survey the estimation cells may be

based on age/race/sex categories of individuals and the known totals may come from the

most recent population census. Although the variance of a post-stratified estimator can

be computed over all possible sample configurations, inferences made conditionally on

the achieved sample configuration are desirable. Theory and a simulation study using

data from the U.S. Current Population Survey are presented to study both the conditional

bias and variance of the post-stratified estimator of a total. The linearization, balanced

repeated replication, and jackknife variance estimators are also examined to determine

whether they appropriately estimate the conditional variance.

Keywords: Asymptotic properties; Balanced repeated replication; Jackknife variance

estimation; Linearization variance estimation; Superpopulation model.

0

1. INTRODUCTION

In complex large-scale surveys, particularly household surveys, post-stratification

is a commonly used technique for improving efficiency of estimators. A clear

description of the method and the rationale for its use was given by Holt and Smith

(1979) and is paraphrased here. Values of variables for persons may vary by age, race,

sex, and other demographic factors that are unavailable for sample design at the

individual level. A population census may, however, provide aggregate information on

such variables that can be used at the estimation stage. After sample selection, individual

units are classified according to the factors and the known total number of units in the cth

cell, Mc , is used as a weight to estimate the cell total for some target variable. The cell

estimates are then summed to yield an estimate for the full population. A variety of

government-sponsored household surveys in the United States use this technique,

including the Current Population Survey, the Consumer Expenditure Survey, the

National Health Interview Survey, and the Survey of Income and Program Participation.

Because post-stratum identifiers are unavailable at the design stage, the number of

sample units selected from each post-stratum is a random variable. Inferences can be

made either unconditionally, i.e. across all possible realizations of the post-strata sample

sizes, or conditionally given the achieved sample sizes. In a simpler situation than that

considered here, Durbin (1969) maintained, on grounds of common sense and the

ancillarity of the achieved sample size, that conditioning was appropriate. In the case of

post-stratification in conjunction with simple random sampling of units, Holt and Smith

(1979) argue strongly that inferences should be conditioned on the achieved post-stratum

sample sizes.

Although conditioning is, in principle, a desirable thing to do, a design-based

conditional theory for complex surveys may be intractable, as noted by Rao (1985). A

useful alternative is the prediction or superpopulation approach which is applied in this

paper to make inferences from post-stratified samples. We will concentrate especially on

1

the properties of several commonly used variance estimators to determine whether they

estimate the conditional variance of the post-stratified estimator of a finite population

total.

Section 2 introduces notation, a superpopulation model that will be used to study

properties of various estimators, and a class of estimators which will be used as the

starting point for post-stratification estimation. Section 3 discusses the model bias and

variance of estimators of the total while sections 4 through 6 cover the linearization,

balanced repeated replication, and jackknife variance estimators. In section 7 we present

the results of a simulation study using data from the U.S. Current Population Survey and

the last section gives concluding remarks.

2. NOTATION AND MODEL

The population of units is divided into H design strata with stratum h containing

Nh clusters. Cluster (hi) contains Mhi units with the total number of units in stratum h

being M Mh hiiNh== 1 and the total in the population being M Mhh

H=

= 1 . A two-stagesample is selected from each stratum consisting of nh 2 sample clusters and a

subsample of mhi sample units within sample cluster (hi). The total number of clusters in

the sample is n nhh= . The set of sample clusters from stratum h is denoted by sh andthe subsample of units within sample cluster (hi) by shi .

Associated with each unit in the population is a random variable yhij whose finite

population total is T yhijjM

i

N

h

hih=== 11 . Each unit is also a member of a class or post-

stratum indexed by c. Each post-stratum can cut across the design strata and the set of all

population units in post-stratum c is denoted by Sc . The total number of units in post-

stratum c is Mc hijcjM

i

N

h hijchih= =

== d d where 11 1 if unit (hij) is in post-stratum c and is0 if not. We assume that the post-stratum sizes Mc are known. Our goal here will be to

study the properties of estimators under the following superpopulation model:

2

E y

y y

h h i i j j hij S

h h i i j j hij S h i j S

h h i i j j hij S h i j S

hij c

hij h i j

hic c

hic hic c c

hicc c c

( )

cov( , )

, , , ( )

, , , ( ) ,( )

, , , ( ) ,( )

=

=

= = =

= = = =

R

S||

T||

m

s

s rt

2

2

0 otherwise . (1)

In addition to being uncorrelated, we also assume that the y's associated with units in

different clusters are independent. The model assumes that units in a post-stratum have a

common mean m c and are correlated within a cluster. The size of the covariances

s rhic hic2 and t hicc are allowed to vary among the clusters and also depend on whether or

not units are in the same post-stratum. The variance specification s hic2 is quite general,

depending on the design stratum, cluster, and post-stratum associated with the unit. All

expectations in the subsequent development are with respect to model (1) unless

otherwise specified.

The general type of estimator of T that we will consider has the form$ $T Thi hii sh h

= g (2)

where g hi is a coefficient that does not depend on the y's,$T M y y y mhi hi hi hi hij hij shi

= = , and . In common survey practice, the set of g hi is

selected to produce a design-unbiased or design-consistent estimator of the total under

the particular probability sampling design being used. Alternatively, estimator (2) can be

written as

$T K yhic hicci sh h= (3)

where K M m m mhic hi hi hic hi hic= g , is the number of sample units in sample cluster (hi)

that are part of post-stratum c, and y y mhic hij hijc hicj shi=

d . If mhic = 0, then defineyhic = 0. There are a variety of estimators, both from probability sampling theory and

superpopulation theory, that fall in this class. Six examples are given in Valliant (1987)

and include types of separate ratio and regression estimators with Mhi used as the

3

auxiliary variable. For example, the ratio estimator has ghi h hii sM Mh=

Error!

Reference source not found., and the regression estimator has

ghih

hh h hs hi hs hi hsi s

N

nn M M M M M M

h

= + - - -

12c hc h c h

where M Mh hs and are population and sample means per cluster of the Mhi ' s . Also

included in the class defined by (2) is the Horvitz-Thompson estimator when clusters are

selected with probabilities proportional to Mhi and units within clusters are selected with

equal probability in which case ghi h h hiM n M= b g. Note that, as discussed in section 3,the estimators defined by (2) are not necessarily model-unbiased under (1).

Next, we turn to the definition of the post-stratified estimator of the total. The

usual design-based estimator of Mc in class (2) is found by using dhijc in place of yhij in

(3) and omitting the sum over c, which gives $M Kc hici sh h=

. The post-stratified

estimator of the total T is then defined as

$ $ $T R Tps cc c= (4)

where $ $R M Mc c c= Error! Reference source not found. and $T K yc hic hici sh h=

. With this

notation the general estimator (3) can also be written as $ $T Tcc= . For subsequent

calculations it will be convenient to w