Subjects/Math/Statistics and Discrete Math/Statistics/Sampling (statistics)

Sampling (statistics) - Survey Design Errors Weights and Random Generation

Understand how to generate random samples, identify and mitigate survey errors and biases, and apply appropriate weighting.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What occurs to an element after it is selected in a with-replacement sampling design?

1 of 14

Summary

Understanding Sample Selection and Survey Quality Introduction Whether you're collecting data through a survey or designing an experiment, how you select your sample fundamentally affects the quality of your results. This unit covers the practical mechanics of random sampling, the choices you make in sample design, and—critically—the errors and biases that can creep into your data even with careful planning. Understanding these concepts will help you recognize which estimates are trustworthy and which require careful interpretation. Generating Random Samples To select a truly unbiased sample, you need a method that doesn't introduce personal preference or systematic patterns into your selection process. Random Number Tables provide one straightforward approach. Published tables contain numbers arranged randomly, allowing researchers to use them as a neutral tool for sample selection. For example, if you have a population of 500 people numbered 001 through 500, you could randomly open the table and read down the columns, selecting only those numbers that fall between 001 and 500. Pseudo-Random Number Generators are mathematical algorithms built into computers that create sequences of numbers that appear random and behave like random numbers for practical purposes. These are especially useful for large samples, where hand-selecting elements from a random number table would be tedious. Key Sampling Design Choices Once you've committed to random selection, two important decisions remain: whether elements can be selected more than once, and how large your sample should be. With-Replacement versus Without-Replacement Sampling In with-replacement sampling, once you randomly select an element, you "put it back" before the next selection. This means the same person, household, or object can theoretically appear in your sample multiple times. In without-replacement sampling, once an element is selected, it's removed from the pool and cannot be chosen again in that particular sample. This is more common in practice because you typically want each sample unit to represent itself once. The choice between these approaches matters mathematically but is often decided by practical considerations: without-replacement is more efficient for data collection since you're not contacting the same person twice. Determining Sample Size You cannot simply guess at an appropriate sample size. Instead, you must specify three things before consulting a sample-size table: Effect size: How large a difference or relationship do you want to detect? Significance level (alpha): What is your acceptable probability of incorrectly rejecting a true null hypothesis? (Usually 0.05) Power (1 – beta): What is your acceptable probability of detecting a true effect when it exists? (Usually 0.80 or higher) Once you've decided on these values, you locate the row corresponding to your desired power level, find the column matching your estimated effect size, and read off the minimum required sample size at the intersection. Errors and Biases in Sample Surveys Even with perfect random selection, your results can be wrong. Understanding where errors come from—and which ones you can control—is essential. Sampling Errors versus Selection Bias These terms describe related but distinct problems. Random sampling error is the natural variation that occurs simply because you're working with a sample rather than the entire population. If you randomly selected different people, you'd get slightly different results. This is expected and unavoidable—but it decreases as your sample size increases, which is why larger samples are more reliable. Selection bias is different and more serious: it occurs when your actual selection probabilities differ from those you assumed. For example, if you attempted to use random-digit-dialing to survey households but systematically missed households without landline phones, you've introduced selection bias because those households had zero probability of selection when you assumed they had equal probability to all others. Non-Sampling Errors Beyond sampling errors, several other problems can distort your data: Coverage errors arise from mismatches between your sampling frame (the list you select from) and the actual population you want to study. Over-coverage means your sampling frame includes people or units outside your target population. For instance, if you try to survey "current employees" but your list includes people who recently left, you have over-coverage. Under-coverage occurs when your sampling frame omits elements that belong to the target population. Historical examples include survey frames based on telephone directories, which systematically excluded people without listed numbers. Measurement error happens when the actual responses differ from the truth because respondents misunderstand questions, lack information, or find certain topics uncomfortable to discuss honestly. Processing error is introduced by your team's mistakes—incorrect data coding, entry errors, or miscalculation—rather than respondent error. Non-response bias is the distortion that occurs when people who don't respond to your survey differ systematically from those who do. If outgoing, confident people are more likely to respond than introverted people, any question about social comfort will be biased. Understanding Non-Response Non-response comes in two varieties, and they require different solutions. Unit non-response occurs when you never hear from a selected individual at all—they refuse the survey, you can't locate them, or they're unavailable during the survey period. These people contribute no data whatsoever. Item non-response is more limited: a participant responds to your survey but skips one or more specific questions, perhaps because they find them intrusive or unclear. Strategies for Mitigating Non-Response Since non-response bias can seriously distort results, invest in preventing it: Improve survey design by using clear language, logical question flow, and professional administration. Offer incentives—even modest ones like gift cards or entry into a prize drawing—to increase participation. Conduct follow-up attempts to reach non-respondents via different methods or times. Additionally, when you do reach non-respondents, collect brief data that lets you compare them to respondents—"Are you generally satisfied with your job?" This reveals whether non-respondents differ meaningfully. Use weighting adjustments when you have population benchmarks (the true percentages of different groups in your population). If your final sample has too few young adults, you can statistically inflate young adult responses to match the true population proportion. Apply imputation methods for item non-response by using statistical techniques to estimate missing answers based on patterns in responses to related questions or similar respondents' answers. Survey Weights: Correcting for Unequal Selection Sometimes different groups in your population have different probabilities of being selected—whether by design or accident. Survey weights are numerical adjustments that restore proper representation. When Weights Are Necessary Apply survey weights whenever different units had unequal probabilities of selection. Consider a few scenarios: Stratified sampling with unequal representation. If you intentionally oversampled a rural population because it's geographically dispersed, those rural responses would otherwise have too much influence on your results. Weighting scales them down to their true population proportion. In this stratified sample with urban and rural strata, the rural stratum may be weighted up (multiplied by a factor greater than 1) if it was undersampled, or weighted down if it was oversampled, to reflect its true population proportion. Household surveys with one respondent per household. When you interview only one person per household, individuals from households with five people had one-fifth the chance of selection compared to individuals in single-person households. Those five-person households must be weighted down to correct this mathematical inequality. Telephone surveys with multiple lines. Households with two or three telephone lines have a higher probability of being reached by random-digit-dialing. These households have an inherent advantage in selection and must be weighted down to give all households equal influence on the final results. The Broader Role of Weights Beyond correcting selection probabilities, survey weights serve an additional critical function: they can compensate for non-response bias by effectively inflating the influence of respondents who share characteristics with non-respondents. If women were underrepresented among respondents, women's responses receive higher weights, so their opinions carry more statistical influence. This doesn't recover the lost information but can reduce distortion if the assumption holds that non-responding and responding women are similar. <extrainfo> Additional Details on Random Number Generation Pseudo-random number generators, despite their name, are entirely deterministic: given the same starting "seed," they produce identical sequences. This reproducibility is actually useful in research—it means your random selection process is verifiable and others can replicate it. True randomness in a philosophical sense is rare in modern statistics; pseudo-random sequences that pass rigorous statistical tests of randomness are typically sufficient for sampling purposes. </extrainfo>

Flashcards

What occurs to an element after it is selected in a with-replacement sampling design?

It can be selected more than once in the same sample.

What happens to an element once it has been selected in a without-replacement design?

It cannot be chosen again in that sample.

Which three factors must be determined before consulting a sample-size table?

Desired effect size Significance level ($\alpha$) Power ($1 - \beta$)

Where is the minimum required sample size located within a sample-size table?

At the intersection of the power row and the effect size column.

When does selection bias occur in a sample survey?

When actual selection probabilities differ from those assumed in the analysis.

What causes the variation in results known as random sampling error?

The random selection of different sample elements.

What happens during over-coverage in a survey?

Data from outside the intended population are included.

When does under-coverage occur in the sampling process?

When the sampling frame omits elements belonging to the target population.

How is processing error typically introduced into survey data?

Through mistakes in data coding or entry.

What is the primary cause of non-response bias?

Failure to obtain complete data from all selected individuals.

What is the difference between unit non-response and item non-response?

Unit is a complete lack of participation; Item is missing specific answers from a participant.

Under what general condition should survey weights be applied?

When different strata have unequal probabilities of selection.

In random-digit-dialing, why do households with multiple phone lines require weighting?

To adjust for their higher chance of selection.

How do survey weights compensate for non-response bias?

By inflating the influence of respondents who represent non-responding groups.

Quiz

What is a primary advantage of using published random number tables when selecting sample units?

1 of 2

Key Concepts

Sampling Techniques

Random number tables

Pseudo‑random number generators

With‑replacement sampling

Sample size determination

Bias in Sampling

Selection bias

Non‑response bias

Survey weighting

Data Handling

Imputation methods

Definitions

Random number tables

Published tables of numbers used to select sample units without bias.

Pseudo‑random number generators

Algorithms that produce sequences of numbers approximating true randomness for sampling.

With‑replacement sampling

A design where each selected element can be chosen again in the same sample.

Sample size determination

The process of calculating the minimum number of observations needed based on effect size, significance level, and statistical power.

Selection bias

Systematic error that occurs when the probability of selecting units differs from what is assumed in analysis.

Non‑response bias

Distortion in survey results caused by failure to obtain data from all selected individuals.

Survey weighting

Adjustments applied to sample data to correct for unequal selection probabilities and improve representativeness.

Imputation methods

Statistical techniques used to estimate missing survey responses based on available data.