4234/5234 Survey Sampling

Time:2021-6-14

Survey Sampling
Statistics 4234/5234 — Fall 2018
Take-home final exam
The following problems are due to Room 203 Mathematics between 7:10pm and 8:00pm on
Tuesday, December 18. You can also submit your paper to the course mailbox in Room 904
SSW, any time before 7:00pm on Tuesday, December 18.
You are not to discuss these problems with anyone other than the instructor, nor consult any
published or on-line reference other than Sampling: Design and Analysis; by Sharon L. Lohr.
Please refer to the Homework requirements section of the Course Information document posted
at the beginning of the course. A portion of your score on the final exam will be based on
presentation; any paper that fails to comply with items 2 through 8 of those requirements will
not earn the presentation points.

  1. Consider the population of size N = 50 given in the file FinalPop.csv, in the “Data”
    folder on Courseworks. Verify that the population mean and variance are ˉyU = 14.882
    and S
  2. = 0.977457, respectively.
    (a) Find the design-based bias and MSE of the sample mean for an SRS of size n = 10,
    that is, find E(ˉy yˉU ) and E[(ˉy yˉU )
    2
    ].
    (b) Consider a stratified random sample of 2 units drawn from each of the 5 strata,
    defined as units 1–10, 11–20, 21–30, 31–40, and 41–50. Find the design-based bias
    and MSE of ˉystr =
    1
    5
    P5
    h=1 yˉh; that is, find E(ˉystr yˉU ) and E[(ˉystr yˉU )
    2
    ].
    Now suppose that the population Y1, Y2, . . . , YN is itself a random sample of size N = 50
    from a normally distributed superpopulation with mean μ = 15 and variance σ
  3. = 1.
    (c) Find the model-based bias and MSE of the estimator in part (a), that is, find
    EM(ˉy yˉU ) and EM[(ˉy yˉU )
    2
    ].
    (d) Find the model-based bias and MSE of the estimator in part (b), that is, find
    EM(ˉystr yˉU ) and EM[(ˉystr yˉU )
    2
    ].
    1
  4. The data for this problem are in the SDaA package. Enter
    library(SDaA)
    Data <- counties[,c(2,3,5,17)]
    and Data will contain total population and number of veterans for a random sample of
  5. of the 3141 counties in the United States. The total population at the time of this
    data set was 255,077,536.
    (a) Using ratio estimation, find an approximate 95% confidence interval for the total
    number of veterans in the United States. Report your answer in millions of veterans,
    rounded to the nearest 10,000.
    (b) Using regression estimation, find an approximate 95% confidence interval for the total
    number of veterans in the United States. Report your answer in millions of veterans,
    rounded to the nearest 10,000.
    (c) Assume the population values are themselves a random sample from a superpopulation
    in which
    Yi = β0 + β1xi + εi
    where E(εi) = 0 and V (εi) = σ
    2
    . Find the model-based standard error of the estimate
    you computed for part (b). How does it compare to the design-based SE?
    (d) Assume the population values are themselves a random sample from a superpopulation
    in which
    Yi = β1xi + εi
    where E(εi) = 0 and V (εi) = σ
    2xi
    . Find the model-based standard error of the
    estimate you computed for part (a). How does it compare to the design-based SE?
  6. A fisherman is interested in N, the number of fish in a certain pond. He catches 100 fish,
    tags them, and throws them back. A few days later, he returns and catches 80 fish, of
    which 18 are tagged.
    (a) Find the maximum likelihood estimate of N, along with its standard error.
    (b) Explain in plain English what your answer to part (a) means, particularly the standard
    error. The fisherman does not know or care about maximum likelihood theory,
    or sampling distributions, or any such things — he just wants to know how many
    fish are in the pond. Help him.
    2
    (c) Find an approximate 90% confidence interval for N by inverting the acceptance region
    of a level .10 Pearson’s chi-square test for independence between the binary variables
    In first day’s catch and In second day’s catch.
  7. The file statepop.csv, available in the “Data” folder on Courseworks, lists the 1992
    population for the 50 states plus the District of Columbia. The file counties.csv contains
    the number of counties for a sample of size 12 with replacement, with probabilities
    proportional to population.
    (a) Estimate the total number of counties in the United States, and find the standard
    error of your estimate.
    (b) With California being sampled three times and New Jersey twice, there were nine
    distinct states in the sample. Writing your estimate in part (a) as t?=
    P
    i∈R wiQiti
    with R = {CA, CO, CT, MA, MO, NJ, TN, VA, WI} and ti = number of counties
    in state i, find wiQi
    for each of those nine states. What is P
    i∈R wiQi
    for this sample?
    What is its expected value over repeated random sampling?
  8. Investigators selected a random sample of 200 teenagers from a population of 2000 for a
    survey of screen time on smartphones and other handheld devices; the overall response rate
    was 75%. A follow-up sample was taken of 10 of the 50 nonrespondents, with responses
    obtained from all 10.
    In the data file ScreenTime.csv, the variable Group takes the value 1 for respondents, 2
    for nonrespondents included in the follow-up survey, and 3 for nonrespondents not in the
    follow-up sample; the variable Minutes gives that individual’s average daily screen time
    in minutes.
    Give an approximate 95% confidence interval for the average screen time per day among
    these 2000 teenagers.
    WX:codehelp