# Techniques for random sampling and avoiding bias | Study design | AP Statistics | Khan Academy

Let’s say that we run a school and in that school there is a population of students right over here. And that is our population. And we want to get a sense of how these students feel about the quality of math instruction at the school, so we construct a survey, and we just need to decide who are we going to get to actually answer this survey.

One option is to just go to every member of the population, but let’s just say it’s a really large school. Let’s say we’re a college and there’s 10,000 people in the college. We say, well, we can’t just talk to everyone. So instead, we say, let’s sample this population to get an indication of how the entire school feels. So we are going to sample it.

We are going to sample that population. Now in order to avoid having bias in our response, in order for it to have the best chance of it being indicative of the entire population, we want our sample to be random. So our sample could either be random, random, or not random. Not random. And it might seem, at first, pretty straightforward to do a random sample, but when you actually get down to it, it’s not always as straightforward as you would think.

So one type of random sample is just a simple random sample. So, simple, simple, random, random, sample, and this is saying, alright, let me maybe assign a number to every person in the school, maybe they already have a student ID number, and I’m just going to get a computer, a random number generator, to generate the 100 people, the 100 students, so let’s say there’s a sample of 100 students, that I’m going to apply the survey to, so that would be a simple random sample. We are just going into this whole population and randomly, let me just draw this.

So this is the population, we are just randomly picking people out, and we know it’s random because a random number generator, or we have a string of numbers or something like that, that is allowing us to pick the students. Now that’s pretty good, it’s unlikely that you’re going to have bias from this sample, but there is some probability that, just by chance, your random number generator just happened to select maybe a disproportionate number of boys over girls, or a disproportionate number of freshmen, or a disproportionate number of engineering majors versus English majors, and that’s a possibility. So even though you are taking a simple random sample that is truly random, once again, it’s some probability that it’s not indicative of the entire population.

And so to mitigate that, there are other techniques at our disposal. One technique is a stratified sample. Stratified. And so this is the idea of taking our entire population and essentially stratifying it.

So let’s say we want to, we take that same population, we take that same population, I’ll draw it as a square here just for convenience, and we’re gonna stratify it by, let’s say we’re concerned that we get a appropriate sample of freshmen, sophomores, juniors, and seniors. So we’ll stratify it by freshmen, sophomores, juniors, and seniors, and then we sample 25 from each of these groups. So these are the stratifications.

This is freshmen, sophomore, juniors, and seniors, and instead of just sampling 100 out of the entire pool, we sample 25 from each of these. So just like that. And so that makes sure that you are getting indicative responses from at least all of the different age groups or levels within your university.

Now there might be another issue where you say, well, I’m actually more concerned that we have accurate representation of males and females in the school, and there is some probability, you know, if I do 100 random people, it’s very likely that it’s close to 50/50, but there’s some chance, just due to randomness, there’s disproportionately male or disproportionately female. And that’s even possible in the stratified case. And so what you might say is, well, you know what I’m gonna do? I’m going to, there’s a technique called a clustered sample. Let me write this right over here, clustered, a clustered sample, and what we do is we sample groups.

Each of those groups we feel confident has a good balance of male females. So, for example, we might, instead of sampling individuals from the entire population, we might say, look, you know, on Tuesdays and Thursdays, and this, well, even there as you can tell this is not a trivial thing to do, let’s just say that we can split, let’s say we can split our population into groups, maybe these are classrooms, and each of these classrooms have an even distribution of males and females, or pretty close to even distributions. And so what we do is we sample the actual classrooms, so that’s why it’s called cluster, or cluster technique, or clustered random sample, because we’re going to randomly sample our classrooms, each of which have a close or maybe a exact balance of males and females so we know that we’re gonna get good representation, but we are still sampling, we are sampling from the clusters, but then we’re gonna survey every single person in each of these clusters, every single person in one of these classrooms. So, once again, these are all forms of random surveys, or random samples, you have the simple random sample, you can stratify, or you can cluster and then randomly pick the clusters and then survey everyone in that cluster. Now if these are all random samples, what are the non-random things like? Well, one case of non-random, you could have a voluntary survey, or voluntary sample, and this might just be you tell every student at the school, “Hey, here’s a web address.

“If you’re interested, come and fill out this survey.” And that’s likely to introduce bias because you might have maybe the students who really like the math instruction at their school more likely to fill it out, maybe the students who really don’t like it are more likely to fill it out, maybe it’s just the kids who have more time more likely to fill it out. So this has a good chance of introducing bias. The students who fill out the survey might be just more skewed one way or the other because, you know, they volunteered for it. Another not random sample would be called you’re introducing bias because of convenience is the term that’s often used, and this might say, well, let’s just sample the 100 first students who show up in school.

And that’s just convenient for me because I didn’t have to use random numbers, or do the stratification, or doing any of this clustering, but you can understand how this also would introduce bias, because the first 100 students who show up at school, maybe those are the most diligent students, maybe they all take an early math class that has a very good instructor where they’re all happy about it. Or it might go the other way, the instructor there isn’t the best one, and so it might introduce bias the other way. So if you let people volunteer or you just say, “Oh, let me do the first N students.” Or you say, “Hey, let me just talk to all of the students “who happen to be in front of me right now.” They might be in front of you out of convenience, but they might not be a true random sample. Now there is other reasons why you might introduce bias, and it might not be because of the sampling.

You might introduce bias because of the wording of your survey. You could imagine a survey that says, do you consider yourself lucky to get a math education that very few other people in the world have access to? Well, that might bias you to say, “Well, yeah, I guess I feel lucky.” Well, if the wording was, do you like the fact that a disproportionate more students at your school tend to fail algebra than our surrounding schools? Well, that might bias you negatively. So the wording really, really, really matters in surveys, and there is a lot that would go into this.

And the other one is just people’s, you know, it’s called response bias. And, once again, this isn’t about… Response bias. And this is just people not wanting to tell the truth or maybe not wanting to respond at all.

Maybe they’re afraid that somehow their response is gonna show up in front of their math teacher or the administrators, or if they’re too negative, it might be taken out on them in some way. And because of that, they might not be truthful, and so they might be overly positive or not fill it out at all. So anyway, this is a very high level overview of how you could think about sampling.

You want to go random because it lowers the probability of their introducing some bias into it. And then these are some techniques. And also think about whether you’re falling into some of these pitfalls that have a good chance of introducing bias.