Subjects/Math/Statistics and Discrete Math/Statistics/Sampling (statistics)

Introduction to Statistical Sampling

Learn the fundamentals of statistical sampling, major probability and non‑probability techniques, and how they underpin confidence intervals and margin of error.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the definition of sampling?

1 of 24

Summary

Sampling: Foundations and Methods What is Sampling? Sampling is the process of selecting a smaller subset of observations from a larger group in order to make inferences about the entire group. Rather than examining every single member of a population—which is often impractical, expensive, or impossible—statisticians use carefully designed sampling methods to gather information that represents the whole. Think of sampling like quality control at a chocolate factory: rather than inspecting every single chocolate produced, an inspector might randomly select a few boxes each hour to verify that the product meets standards. The results from those few boxes inform conclusions about all the chocolate made during that shift. Population Versus Sample Two fundamental terms form the foundation of sampling: Population is the entire group of interest—all members that share a common characteristic you want to study. A population is often theoretical and may be enormous. For example, the population might be all registered voters in a country, all customers who have ever purchased from a company, or all manufactured parts from a production line. Sample is the subset of the population that you actually observe and measure. The sample is the data you collect. Continuing our factory example: the population is all chocolate made during the month, while the sample is the boxes you actually tested. The key relationship is this: you collect data from the sample and use it to estimate characteristics of the population. The Principle of Representative Sampling The central principle of sampling is that a carefully chosen sample will have properties that closely resemble those of the population. This resemblance allows sample statistics (like the sample mean or sample proportion) to serve as reliable estimates of population parameters (like the population mean or population proportion). What Makes a Sample Representative? A representative sample is one that accurately reflects the composition and characteristics of the population. In other words, it's a microcosm of the population. For a sample to be representative, you must understand two distinct concepts: Bias is systematic error—it occurs when your sample systematically differs from the population in ways that lead to consistently wrong estimates. For example, if you survey voters only by visiting shopping malls on weekday afternoons, you'll systematically miss working people and night shift workers, creating a biased sample that misrepresents the voting population. Sampling error is random variation. Even with a perfectly designed sampling method, your sample statistics won't exactly match the population parameters—they'll vary randomly from sample to sample. This is natural and expected. The good news: sampling error decreases as your sample size increases. This distinction is crucial: bias is a problem with your sampling method, while sampling error is an inevitable consequence of sampling rather than surveying everyone. Sample Size Matters A key insight: larger samples generally reduce sampling error. This is why election polls with 1,000 respondents are more precise than those with 100. However, this doesn't mean you always need enormous samples. A modestly-sized sample of 200 to 500 respondents, if selected properly through a random method, can provide surprisingly informative results. The key is not just size, but the method used to select the sample. Probability Sampling Methods Probability sampling methods use randomness to ensure each population member has a known (and typically equal) chance of being selected. These methods are preferred because they reduce bias and allow us to calculate sampling error mathematically. Simple Random Sampling In a simple random sample, every individual in the population has an equal chance of being selected, and the selection of one person doesn't affect the selection of another (selections are independent). This is the most straightforward probability method. Imagine putting all population members' names in a hat, mixing thoroughly, and drawing names without looking. In practice, computers generate random numbers to accomplish this. Why use it? Simple random sampling is unbiased and straightforward. It guarantees that every possible sample of a given size has an equal probability of being chosen. Limitation: It requires a complete list of the population (called a "sampling frame"), which is sometimes unavailable. Stratified Sampling Stratified sampling works differently. First, you divide the population into subgroups called strata (the singular is "stratum"). Each stratum shares some important characteristic—for example, you might divide voters into age groups, or a company's customers into geographic regions. Then, you randomly sample from within each stratum. Usually, you sample proportionally to the stratum's size. If young adults represent 30% of the population, you'd sample them at 30% of your total sample. Why use it? Stratified sampling ensures that important subgroups are represented in your sample, and it typically increases the precision of your estimates compared to simple random sampling. If you're studying average income and income varies dramatically by education level, stratifying by education ensures each education group is represented proportionally. Cluster Sampling Cluster sampling divides the population into clusters (geographic areas, schools, or other natural groupings) rather than individual-level strata. You then randomly select some clusters and include all members from those clusters (or sample within them). In one-stage cluster sampling, once you select the clusters, you survey everyone in those clusters. In two-stage cluster sampling, you randomly select clusters, then randomly sample individuals within each selected cluster. Why use it? Cluster sampling is invaluable when you lack a complete list of population members but do have a list of clusters. For example, to survey households in a country, you might lack a complete list of all households, but you have a list of neighborhoods. You randomly select some neighborhoods, then survey households within those neighborhoods. Trade-off: Cluster sampling typically has more sampling error than simple random sampling (because people within a cluster tend to be similar to each other), but it's much cheaper and more practical when population lists don't exist. Systematic Sampling Systematic sampling provides an easy-to-implement alternative. First, you order the population list. Then, you randomly select a starting point, and select every $k$-th element from that point onward (where $k$ is some fixed interval). For example, from a list of 1,000 people, you might randomly select a starting point (say, person #7) and then select every 5th person thereafter: persons 7, 12, 17, 22, and so on. Why use it? It's simple to implement and often works well in practice. Potential pitfall: Systematic sampling can produce biased results if the population list has a hidden pattern that aligns with your selection interval. For instance, if you're selecting every 7th person from a list where people are ordered by households (with 7 people per household), you'd repeatedly sample the same position within households, potentially biasing your sample. Always check whether the list's ordering might create problems. Non-Probability Sampling Methods Non-probability sampling methods do not use randomness to ensure each population member has a known chance of selection. While easier and cheaper, they are prone to bias. Convenience sampling selects participants who are easy to reach—for example, surveying shoppers at the mall nearest your office. It's inexpensive but highly prone to bias because those who happen to be convenient to reach may differ systematically from the broader population. Quota sampling divides the population into groups and fills predetermined numbers (quotas) for each group—for example, "I need 50 men and 50 women." While this ensures group representation, the method is non-random because you choose which men and women to include based on convenience. Snowball sampling relies on existing participants to recruit additional participants, creating a chain-referral process. It's useful for hard-to-reach populations (homeless individuals, undocumented immigrants) but highly subject to bias because networks tend to be self-similar. A Critical Limitation The fundamental problem with non-probability samples is that they lack the randomness needed for formal statistical inference. You cannot reliably construct confidence intervals or conduct hypothesis tests using non-probability samples because you have no way to calculate sampling error. These methods may provide descriptive information about your sample, but not trustworthy estimates of population parameters. Key Statistical Concepts from Sampling Sampling Distribution and the Central Limit Theorem When you repeat a sampling procedure many times, each sample produces a different statistic (like a sample mean). The sampling distribution is the distribution of these statistics across all possible samples. Here's the remarkable part: The Central Limit Theorem tells us that, under general conditions, the sampling distribution of the sample mean is approximately normal (bell-shaped), even if the population itself is not normal. This is true for large enough samples. Why does this matter? Normality allows us to quantify uncertainty using standard statistical tools and formulas. We can calculate how far our sample statistic is likely to be from the true population parameter. Confidence Intervals A confidence interval is a range of values constructed from sample data that, with a stated confidence level, is expected to contain the true population parameter. For example, a political poll might report: "Candidate A will receive 52% of the vote, plus or minus 3 percentage points, with 95% confidence." The interval is 49% to 55%. Interpreting "95% Confidence" This is frequently misunderstood. A 95% confidence interval does NOT mean there's a 95% probability that the true parameter lies in that particular interval. Instead, it means: if we repeated the sampling process many times and constructed an interval each time using the same method, approximately 95% of those intervals would contain the true parameter. Think of it this way: The method is 95% reliable. Any single interval might or might not contain the true parameter, but the method itself is right 95% of the time across many repetitions. Margin of Error The margin of error is the "plus or minus" value quoted in poll results. It measures the likely extent of sampling error. A margin of error of ±3 percentage points means the true population proportion is probably within 3 points of what the sample showed. Crucially: the margin of error decreases as the sample size increases. With more data, your estimates become more precise. This relationship is approximately inverse to the square root of sample size—doubling your sample size reduces the margin of error by a factor of $\sqrt{2}$ (about 1.41).

Flashcards

What is the definition of sampling?

The process of selecting a smaller set of observations from a larger group to make inferences about the whole group.

What are the two main components involved in sampling?

Population and sample.

What is the primary goal of sampling?

To estimate population characteristics (averages, proportions, or variances) without measuring every member.

What is the general principle regarding a carefully chosen sample?

It will have properties that closely resemble those of the population.

How is a population defined in the context of sampling?

The entire group of interest.

How is a sample defined in the context of sampling?

The subset of members actually observed from the population.

What defines a representative sample?

It accurately reflects the composition of the population.

What is bias in sampling?

A systematic error occurring when a sample does not accurately represent the population.

What is the definition of sampling error?

Random variation arising because a sample includes only part of the data.

How does increasing the sample size affect sampling error?

It generally reduces sampling error.

What are the two requirements for a simple random sample?

Every individual has an equal chance of being selected. Selections are independent.

How is stratified sampling performed?

Divide the population into subgroups sharing a characteristic and draw a random sample from each.

What is the basic procedure for cluster sampling?

Group the population into clusters and randomly select a set of clusters to survey.

What is the procedure for systematic sampling?

Select every $k$-th element from an ordered list after a random start.

Under what condition can systematic sampling produce biased results?

If the ordering of the list contains a hidden pattern aligning with the selection interval.

What is the definition of convenience sampling?

Selecting participants who are easy to reach without using random selection.

How does quota sampling ensure specific subgroup sizes?

By filling predetermined numbers for certain groups without using random selection.

Why are formal inferences unreliable with non-probability samples?

Because they lack random-selection guarantees.

What is a sampling distribution?

The distribution of a statistic (e.g., sample mean) over many repeated random samples.

What effect does the Central Limit Theorem have on the sampling distribution of the mean?

It makes the distribution approximately normal.

What is the definition of a confidence interval?

A range constructed from the sample expected to contain the true population parameter at a stated confidence level.

What does a 95% confidence interval signify if the sampling process were repeated many times?

Approximately 95% of the intervals would contain the true population value.

What does the margin of error express?

The extent of sampling error.

What is the relationship between the margin of error and sample size?

The margin of error decreases as the sample size increases.

Quiz

What best describes a representative sample?

1 of 2

Key Concepts

Sampling Methods

Sampling

Representative sample

Sampling bias

Simple random sample

Stratified sampling

Cluster sampling

Systematic sampling

Statistical Concepts

Sampling distribution

Central limit theorem

Confidence interval

Margin of error

Definitions

Sampling

The process of selecting a subset of observations from a larger population to infer characteristics of the whole.

Representative sample

A sample that accurately mirrors the composition and diversity of the target population.

Sampling bias

Systematic error introduced when a sample does not faithfully represent the population, leading to distorted conclusions.

Simple random sample

A sampling method where every individual in the population has an equal and independent chance of selection.

Stratified sampling

A technique that divides the population into homogeneous subgroups (strata) and draws random samples from each stratum.

Cluster sampling

A method that groups the population into clusters, randomly selects whole clusters, and surveys all or a sample of members within them.

Systematic sampling

A procedure that selects every k‑th element from an ordered list after a random start, simplifying implementation.

Sampling distribution

The probability distribution of a statistic (e.g., sample mean) obtained from repeated random samples of the same size.

Central limit theorem

A statistical principle stating that the sampling distribution of the sample mean approaches a normal distribution as sample size grows, regardless of the population’s shape.

Confidence interval

A range calculated from sample data that, with a specified confidence level, is expected to contain the true population parameter.

Margin of error

A measure of the expected maximum difference between a sample estimate and the true population value, typically expressed as “plus or minus” a percentage.