Chapter 2 The Basics of Statistics

Figure 2.1: A Statypus Shooting Hoops
The platypus is venomous, but only the males. A male platypus has spurs on their hind legs that excrete a toxin which may help them fight off other males during mating season. Don’t worry too much though, this toxin is not lethal to humans.18
Statistics can be roughly defined as the use of data to understand the world around us. Whether it be the desire to know if it is going to rain tomorrow or if a new medication can help patients, data can be used to leverage knowledge to better our lives. There is no escaping the ubiquity of data in the modern world. Some of the largest companies in the world deal the same wares: data. Whether it be the types of advertisements that an internet user is most likely to actually interact with or the amount of traffic on the roads between your home and the park, humans create more data in a day than they can imagine. The most successful companies have learned to leverage the immense amount of this data as well as the unbelievable speed of modern computing to offer technologies and products that have fundamentally changed the world we live in. With the addition of artificial intelligence, you can literally ask a digital assistant how long it will take to get to the park where you plan to meet your friends and it can detect a major traffic jam due to an accident on the primary path and offer the best alternative route nearly instantly. Underlying all of these advancements lie two main players: the advancement of physical technologies (computers, smartphones, etc.) as well as learning to understand the world around us via data… Statistics.
New R Functions Used
All functions listed below have a help file built into RStudio. Most of these functions have options which are not fully covered or not mentioned in this chapter. To access more information about other functionality and uses for these functions, use the ?
command. E.g. To see the help file for sample
, you can run ?sample
or ?sample()
in either an R Script or in your Console.
We will see the following functions in Chapter 2.
sample()
: Takes a sample of the specifiedsize
from the elements ofx
.sort()
: Sort (or order) a vector or factor (partially) into ascending order.
2.1 Terminology
A Datum is a single piece of information about someone or something. Data is a collection of more than one datum. Data by itself is likely only interesting to the abstract thinkers such as the mathematicians and statisticians who live in an ivory tower.19 For the data to improve our world, data must be attached to someone or something. An Individual is a single element that we can gather information. An individual need not be a person, but could be an apple, a part produced by a machine, a business, or anything else that can be viewed as a single element.
A single individual, like a single datum, is often all but worthless by itself. It is collections of individuals, like data, that offer us the power to begin to leverage the power of statistics.
Definition 2.1 A Population is a collection of all individuals that we wish to understand.
A Sample is any subset of a population.
The concept of an individual will get even more confusing when we introduce the idea of a Sampling Distribution in Section 9.1, but it is important to remember that both populations and samples are made out of individuals which are all homogeneous. While both a person and an apple may be able to be individuals, it is unlikely that they are individuals of the same population.
Definition 2.2 A Parameter is a numerical value that describes some characteristic of a population.
A Statistic is a numerical value that describes some characteristic of a sample.
Example 2.1 Let’s say we wanted to know the average height of a college student. This is easily stated and seems fairly simple, but once we begin to try and properly define things, we will see the need for being more specific.
If we are only going to be able to sample students from one university, we should then also adjust the population to be the students at that particular university. For discussion’s sake, we will call this fictional university, SU (Statypus University). Even if we wanted to define the population to be all SU students, even this offers a bit of ambiguity. Will we include graduate students or only undergraduate students? If we do restrict to undergraduate students, what about those students who are non-degree seeking? Some students taking undergraduate courses are only seeking a certificate or may be trying to earn college credit while still in high school. Do we want to include those people as individuals in the population we are considering?
At this point, you may be wondering why we are being so careful in defining our population of SU students for an inquiry as boring as average height and this is a very fair point. However, most readers of this book have likely been curious about the following type of question.
Anyone who has ever applied to a college or university has wanted to know the answer of this question (at least for real colleges and/or universities). If a college were to average in thousands of students who were only taking a single course as part of an online certificate or students not living on campus, this would obviously lower the average cost. As a potential student, you would want to know that the value they offered was a good estimate of the parameter in question.
Using the context Example 2.1 above, answer the following questions.
How would you define the population to make the average cost estimate as useful to potential students as possible?
Assuming you have a sample of the population you defined in part 1, what are the statistic and parameter for this situation?
Do you think it would be possible to calculate the actual parameter value discussed in part 2?
Example 2.1 shows that the process of explicitly defining a population is essentially identical to explicitly defining an individual. While we framed the question as to what the population was, we actually were defining the characteristics that defined whether a person was an individual of the population or not.20 In practice, the definitions of individual and population are essentially different views of the same underlying concept.
Once we have defined the population we are wanting to research (we will see how to do research in Section 2.3), it comes time to gather information from individuals in the population.
Definition 2.3 A Census is the process of collecting information from an entire population.
A Survey is the process of collecting information from the sample of a population.
A major theme of statistics is trying to understand a parameter for a certain population. However, achieving a true census and actually ascertaining the real value of a parameter is often difficult if not impossible.
A major portion of statistical research involves trying to find a statistic to estimate a given parameter and then offering some level of certainty as to the accuracy of the estimate.
The United States attempts to conduct a census of all persons living in U.S. residential structures. This includes citizens and all other persons who physically live within the United States. However, the response rates among counties in the country ranged from 13.3$ to 84.9%.21 Thus, in the technical sense, the decennial United States Census is the government’s best attempt at a census, but in 2020 it only succeeded in gathering information from 67.0% of households making it, technically, just a very large survey.
Example 2.2 PVC pipe is used for a lot of plumbing purposes these days. It has proven to be a very trustworthy and cost effective solution for situations such as removing water from a sump well to minimize the potential of flooding of basements. This plastic pipe is rated22 for a certain level of pressure, often measured in pounds per square inch or PSI.
However, the testing of PVC pipe requires finding the pressure where the pipe will fail or burst, thus destroying it in the process. Thus, if you were to test the strength of every individual piece of PVC pipe, you would destroy all of the piping. This is not the only type of testing that is destructive in nature and a simple example of the sometimes impossibility of gathering a true census.
Can you think of other situations when gathering a census would either be impossible or at least extremely difficult?
2.2 Types of Variables
The data associated to an individual can be broken into different Variables. A dataset may only have a single variable for each individual or it may have several.
Example 2.3 To illustrate this idea, we can look at a dataset called Asylum1849
which
<- read.csv("https://statypus.org/files/Asylum1849.csv") Asylum1849
This dataset contains intake information for the Meerenberg Insane Asylum in Brederodelaan, Santpoort, The Netherlands.23
We can run str
on the dataset to learn more about it.
str( Asylum1849 )
## 'data.frame': 241 obs. of 10 variables:
## $ Name : chr "Peets, Phillipus Christoffel" "de Bruin, Jan Adriaan" "Puntier, Willem Frederik" "Beekman, Johannes Theodorus" ...
## $ Diagnosis : chr "Imbecility" "Monomania" "Imbecility" "Monomania" ...
## $ Sex : chr "M" "M" "M" "M" ...
## $ Age : int 32 51 25 27 64 22 53 19 42 21 ...
## $ Profession : chr "none" "Office Worker" "none" "none" ...
## $ intake : chr "27 Jun 1849" "27 Jun 1849" "27 Jun 1849" "27 Jun 1849" ...
## $ residence : chr "Amsterdam" "Amsterdam" "Amsterdam" "Amsterdam" ...
## $ married.single: chr "Single" "Single" "Single" "Single" ...
## $ religion : chr "Reformed" "Reformed" "Lutheran" "Reformed" ...
## $ DiedInstitute : chr "yes" "yes" "yes" "no" ...
This tells us that Asylum1849
is a data frame, which we introduced in Section 1.6, and it contains information about 241 obs. of 10 variables
.
This means that Asylum1849
has data on 241 individuals and data for (up to) 10 variables for each individual.24 The variable names are listed as Name
, Diagnosis
, Sex
, Age
, etc. and have values such as names, other character strings, dates, and numbers.
There are many different ways to categorize different types of data. We will take a fairly minimalist approach here and break variables into only 2 major groups: Quantitative and Qualitative Variables. This distinction is not based on whether the data is saved numerically, but if arithmetic operations offer meaningful results.
2.2.1 Quantitative Variable
Definition 2.4 A Quantitative Variable is a variable that has data for which arithmetic calculations offer meaningful information about the variable.
We acknowledge that the term meaningful in Definition 2.4 is a bit subjective. We will not use definitions in this section as strict rules, but simply as guidelines in an attempt to differentiate between different types of variables. In general, subtraction is the most common arithmetic operation that offers meaningful results.
Example 2.4 In the data frame Asylum 1849
, even though the variable intake
is saved as a character string, it would make sense to say that the value 27 Jun 1849
is 8 days prior to the value 4 Jul 1849
. That is we can make the calculation of 4 Jul 1849
minus 27 Jun 1849
to get 8 days. Therefore we will consider intake
a quantitative variable.
2.2.2 Qualitative Variable
Example 2.5 The data frame Asylum1849
had many other variables which were stored in the chr
(character string) format like intake
. For example we can look at the Diagnosis
variable. The first two values of Diagnosis
are "Imbecility"
and "Monomania"
.25 While we can suppose it was possible for someone to have both imbecility and monomania, the concept of adding the diagnoses or doing any other arithmetic operation on the values offers no meaningful result.
Definition 2.5 A Qualitative Variable, sometimes called a Categorical Variable, is any variable where arithmetic calculations are impossible or where they would not offer any meaningful information. The values that the variable takes on are called Categories.
Example 2.6 We begin by loading in the file called BabyData1
. This file contains data about the births of a sample of babies.
<- read.csv( "https://statypus.org/files/BabyData1.csv" ) BabyData1
We will do a deeper dive into BabyData1
starting in Chapter 3, but we will instead just focus on the variable BabyData1$orig_id
.
str( BabyData1$orig_id )
## int [1:200] 1047483 1468100 2260016 3583052 795674 3544316 3726920 2606970 2481971 243759 ...
We see that this variable is made up of 200 integers. However, if we were to subtract the first entry from the second, as shown in the R code below,26 the result has no meaning to us as the individual values are simply the original ID numbers for each of the births.
$orig_id[ 2 ] - BabyData1$orig_id[ 1 ] BabyData1
## [1] 420617
Thus, even though BabyData1$orig_id
contains integer data, the variable it is measuring is only qualitative. We will expand on this topic in Chapter 2.
2.2.3 Discrete vs. Continuous Variable
If we are looking at a quantitative variable, we can further categorize it based on how “granular” it is. The basic idea is whether or not there is any “space” in between distinct values the variable can take on. For example, you can observe either 1 or 2 people that benefit from a certain treatment, but most people have a height that falls between 1 and 2 meters. It is not possible to observe 1.83 people benefit from a new cancer treatment, but a person can easily be measured at 1.83 meters tall.
Definition 2.6 A Discrete Quantitative Variable or simply a Discrete Variable is a quantitative variable that takes on only integer values.27
Example 2.7 We return to the mtcars
dataset we previously saw in Chapter 1. We begin by running str
on the dataset.
str( mtcars )
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We can also run ?mtcars
to pull up the help file for this data frame. Doing this we can learn that the variable cyl
gives the number of cylinders that a car has. The value of cyl
can thus only be a whole number making mtcars$cyl
a discrete quantitative variable.
Definition 2.7 A Continuous Quantitative Variable or simply Continuous Variable is a quantitative variable that can take on any value in some interval of the real line.
Example 2.8 Returning to the mtcars
data frame, we can now look at the variable qsec
. This variable records the time it took for the car to drive a quarter mile from a complete stop. This variable can take on any value and is not restricted to integer values.
2.2.4 Discretized Continuous Data
There is an awkward reality when it comes to measuring continuous variables. Any instrument is only able to measure to a certain level of precision and all continuous variable is essentially saved as discrete data. The simplest way to make this distinction clear is to consider the following question:
Most people will be able to quickly answer this question, but nearly everyone will respond with an integer. A college freshman will likely say “I am 18” even if it is not their 18th birthday. Our age is definitely a continuous variable, but we often record it (and mentally store it) as a discrete value measured in years. That is, we essentially are interpreting the question as
This may differ for babies and young children whose age may be talked about in terms of weeks, months, or halves of years (or even hours or days for newborns), but most adults view their age as a discrete value despite the fact that our age is by definition continuous data.
We end this remark with the following concept.
A continuous variable is a variable where the value can take on any value within a range of real numbers. However, nearly all continuous variables are saved as discrete values once we measure it to a certain level of precision. That is, the underlying variable is continuous, but we record the data as a discrete value.
Can you think of another type of variable that we mentally store as discrete but is actually continuous?
2.3 Types of Research
There are numerous books and courses being taught on Research Design at both the undergraduate and graduate level. There are numerous degrees, even at the graduate level, for each of the topics in Sections 2.3, 2.4, and 2.5. As such, these sections should not be seen as exhaustive and only as a brief introduction into concepts we will essentially take for granted here at Statypus. That is, we will nearly always begin our investigations after a sample has already been found and surveyed.
There are many different ways in which to conduct statistical research. However, it can be broken down into 2 major types. The distinction basically comes down to how “clean” you want to keep your hands. While this metaphor may be confusing, we hope the following sections will clear this up.
2.3.1 Observational Studies
Definition 2.8 An Observational Study is a form of statistical research where an inference is made about a population from a sample where the researcher has no control over the individuals in the study.
The reason for the lack of control may occur for logistical or ethical reasons. For example, if you wanted to examine the effects of domestic violence on the intelligence of the children raised in the home, it would be unethical to control the amount of domestic violence that a child was exposed to. The effects of lead or other heavy metals in drinking water is also a concern for childhood development, but controlling water to guarantee that some children were exposed to heavy metals would be highly unethical.
We would be remiss if we didn’t acknowledge the lack fo content about ethics within statistics on Statypus. The need for ethical practices is vital to the subject as well as the world that is impacted by it, but this text will not make many statements about ethics. We will, however, refer the reader to this Article by the American Statistican Society for a set of guidelines that all good statypi look to uphold.
Definition 2.9 A Longitudinal Study is a type of observational study that makes repeated measurements of the same variable(s) over a long period of time.
Example 2.9 An example of a Longitudinal Study is a Cohort Study where a cohort (a group of people with a common characteristic) is examined over time. The Nurses’ Health Study gathered its first cohort in 1976, a group of over 121,000 female registered nurses who were married and between the ages of 30 and 55. A second cohort of over 116,000 female nurses was begun in 1989. A third cohort was begun in 2010 and marked the first instance of including male nurses into the study. These cohorts have offered data across wide ranges of human variables and continue to gather further information.
Observational studies can also be Retrospective and look at already existing data. For example, a private high school may use the the success of their graduates to provide evidence that the school is a good choice for prospective students and parents.
2.3.2 Experiments
Definition 2.10 A Statistical Experiment or simply an Experiment is a form of statistical research where the researcher attempts to control every variable except for the ones being examined.
As compared to an observational study, in an experiment the researcher is able to get their hands “dirty” and actually affect the individuals. In an observational study, the individuals can be imagined as living within a “glass dome” where they can be observed and measured, but where the researched cannot affect them in any way. In an experiment, the researcher gets to set the conditions as meticulously as they like to attempt to minimize different forms of bias.
The value of a variable that a researcher controls is called a Treatment. For example, if a researcher wanted to test if a new chemical formula would improve the strength of a belt used in an engine, there would be (at least) two types of treatment: the original formula and the new formula.
One major method for minimizing bias in an experiment is the concept of Blinding. An experiment is said to be Single Blinded if the individual does not know the type of treatment they are being given and an experiment is said to be Double Blinded if neither the individual or the person administering the treatment knows the type that it is. In order to ensure blinding, researchers will often utilize something called a Placebo. A placebo is a treatment which is meant to have no impact on the individual but which mimics the delivery of a treatment which may have an impact.
Example 2.10 Suppose we wanted to test if a new chemical compound can help a person recover quicker from strenuous exercise. The researcher would divide his sample of individuals into two groups. After having both groups endure similar levels of strenuous exercise, one group would be given the chemical compound in the form of a pill. The other group would also be given a pill which looks as similar to the new pill as possible but which should not have any impact on their ability to recover. This way the individuals would not know whether or not they had been given the chemical compound and could remove the concept of the Placeo Effect.
2.4 Bias
The major goal of a sample is to offer a collection of individuals that represents a population. However, the chance that a sample will yield a statistic that exactly aligns with a population parameter is unlikely if not impossible. As Critical Consumers of statistics, we must always be on the outlook for where a statistical result does not offer a good estimate of a population/parameter.
Definition 2.11 Bias is any systematic tendency for a statistical calculation to give an inaccurate estimate about a population.
It is the view of us here at Statypus that bias can never be completely eliminated as some sources are often unknown. However, it should always be the goal of a researcher to minimize any source of bias as much as is practical.
Before getting into explicit examples of bias, we will break the concept into two major subgroups. There are many other ways we could refine our understanding of bias and our take on it here at Statypus is admittedly simplistic. We encourage all readers to think of other ways that bias can enter statistical research and the impact it has.
Definition 2.12 Sampling Bias is any bias that results from the methods and techniques used to gather a sample.
Example 2.11 One example of sampling bias occurred during the 1969 Unites States military draft.28 Despite the government’s best efforts, the resulting sample of people drafted into military service was biased. People with birthdays later in the year had a higher chance of being selected than those born earlier in the year.
An over-simplified version of what happened was that all possible birthdays (neglecting the year of birth) were written on pieces of paper and placed into a barrel. The barrel was then mixed and at that point the pieces of paper were pulled from the barrel. However, the pieces of paper (which were in small plastic canisters) where placed into the barrel from 1 January to 31 December. As such, the later dates started on the top of the barrel.
The mixing method was intended to be random, but it did not do an adequate job and left later dates on top of the barrel at a higher proportion than should have existed randomly.
Example 2.12 While it may seem totally foreign to most modern readers, there used to be huge books produced regularly that listed out the home address and telephone number of the head of nearly every United States household. Telephone books used to be a very common resource that was required to interact with the world and now people often don’t even know the actual number of people they call (or message) the most.
The loss of land-line telephones marked a decided shift in the way that a lot of surveys were given. It is much more common now to receive numerous requests to be a member of a sample in your email inbox daily. This barrage of requests has definitely caused a shift in Nonresponse Bias in the 21st century. This type of bias occurs when a certain portion of a population does not respond to requests to partake in a survey which causes the eventual sample to not represent the views of the population as a whole.
In particular, most people are unwilling to fill out surveys about topics for which they do not believe strongly. It is often only those people whose opinions are the most extreme that will elect to participate in surveys. Some people will quickly avoid a conversation or survey about controversial topics such as gun control or abortion while others will eagerly offer their input. This element of Self Selection is unlikely to yield a sample which can be viewed as representative of a population.
Example 2.13 Sampling bias can also occur when the individuals are not people. If we wanted to test the quality of a piece of a toy plane being manufactured, we could randomly pull a sample from the production line. However, if the sample was simply the first 20 pieces off the machine on a given day it may not adequately represent the defects that occur as the machine gets too hot in the afternoon heat of the factory.
Definition 2.13 Survey Bias is any bias that results from the process that begins when a sample is determined and ends with a calculation being made.
Example 2.14 In comparison to the nonresponse bias we saw just a moment ago, the concept of Response Bias is a type of Survey Bias. Roughly speaking, Response Bias occurs when an individual does not answer accurately for numerous reasons. Some people try to avoid conflict at all costs while others seek out places to engage in debate. In either extreme, it is possible for people to offer answers that do not accurately represent their actual view. This type of bias can be avoided by ensuring questions are worded in a way that does not ask an individual to agree with any stated opinion, but to simply offer their opinion on a topic. The order of available answers to questions as well as the order of the questions themselves can also have an impact of the data that is collected.
Response bias can also happen if the individual is reluctant to express displeasure about a product or service in order to not be mean towards the questioner. If a baker asks you if you liked their cookie, you may offer them a good response despite not having actually enjoyed the dessert. It has also been found that people will often behave differently when they know they are being recorded for an experiment. They may try to figure out what the experimenter “wants to hear” and respond accordingly.
In any situation, when an individual offers a response that does not correctly reflect their opinion, a survey will suffer Response Bias.
Example 2.15 If, for example, our experiment involves making measurements of a certain plant, the concept of Measurement Bias can occur. This bias can arise if different researchers use different methods or tools to measure individual plants. If not carefully accounted for by training the researchers, it is not unlikely that two different people would measure the height of a given plant differently. One person may measure from the soil level to the top of the plant as they gently stretch the plant to its highest point while another person would simply measure to the top of where the plant fell on its own.
Think of other ways that bias could occur in statistical research. Can you think of ways to remedy or minimize these sources of bias?
2.5 Sampling
We defined sample in Section 2.1. We know that the purpose of a sample is to represent a population when we are unable to gather information from the entire population. Therefore, it is vital to have some basic understanding of the fundamentals of what should be considered when sampling a population to minimize sampling bias.
2.5.1 Simple Random Samples
One of the simplest ways to minimize sampling bias is to ensure that our sample is chosen at random and not selected for any ulterior motives. The classic way of doing this is often described as putting names in a hat and drawing them one at a time. If done correctly, this should give a Simple Random Sample.
Definition 2.14 The process of creating a Simple Random Sample, or SRS, of size \(n\) from a group of \(N\) items means that we choose a subset of \(n\) items such that every subset of size \(n\) is equally likely.
Regardless of intent, human beings are not really capable of being truly random. If you have any doubt of this, there are many places where you can play the classic “Rock, Paper, Scissors” game online against an AI opponent. If you could play the game truly randomly, then the computer should not be able to beat you more than one third of the time. However, this is rarely the case and the computer will often have a much higher win rate.
Randomness is absolutely essential in statistics and choosing a Simple Random Sample (SRS) is a critical tool to prevent bias from entering into an experiment or observational study. The use of random digit tables was a standard technique for many years, but computers can remove the tedious techniques needed for that process and produce results much quicker. The ability to generate samples quickly and to be able to integrate them directly into our calculations is a technique that can be exploited in many areas of statistics.
2.5.1.1 Using sample
for an SRS
To select a SRS with R, we first list the individuals in the population we are studying. Every individual in the population is given a number from 1 up to the number of people in the population. We then use the R function sample()
to randomly select n
numbers from our list.
The syntax of sample
is
sample( x , size ) #Will only run if x and size are given correctly.
where the arguments are:
x
: The vector of elements from which you are sampling.size
: The number of individuals you want in your sample.
We will use more arguments in
sample
in the future, but for a SRS, we need only these 2 arguments.The way you choose which number goes to each person in the population does not matter, as R will handle the issue of randomness.
For this context, we will often let x
be a vector of form 1:N
where N
is the number of people in the population that we are sampling from.
Let’s pretend we want to randomly select 3 people from a class of 25 students to represent the class at a conference. The code
<- 1:25 x
creates a vector named x
that contains the numbers \(1, 2, 3, \ldots, 24, 25\). To randomly select one of these numbers we can use the following code:
sample( x = 1:25, size = 1 )
## [1] 20
We randomly found the value of 20. However, if you run the code above, your computer likely gave you a different number and if you run the code again, you will almost certainly get a different number.
Each time we run sample
we are likely to get a new result, so the fact that two samples do not match actually indicates that R is using a random process.
The size argument tells sample how many random numbers you want, which is what we called n
. So if we wanted 3 people and not just 1, we can use the code:
sample( x = 1:25, size = 3 )
## [1] 20 25 14
Observe that the three numbers chosen are all different. Rerun this code to see you (almost certainly) get a new string of numbers.
We have now found a SRS of size n = 3
for a group of 25
people. The output above has randomly selected persons numbered 20, 25, and 14.
Unless you tell it otherwise, sample will never choose the same outcome twice.
If you ask sample
for more values than the size of x
, you get an error. For example, if we tried to select 30 people from a population of 25, we would get the following error.
sample( x = 1:25, size = 30 )
## Error in sample.int(length(x), size, replace, prob): cannot take a
## sample larger than the population when 'replace = FALSE'
This error can be remedied with the concept of resampling using a different argument in sample
, but we will not cover that concept here and see that in Chapter 6.
For sample
and many other functions in R, you are not required to name the arguments with x = ...
or size = ...
as long as these come first and second in the function. For example:
sample( 1:25, 3 )
## [1] 13 20 17
gives another SRS with the same arguments as above.
However, it is often clearer to explicitly name the arguments to complicated functions like sample
. Use your best judgment, and include the argument name if there is any doubt.
2.5.1.2 Using sort
for an SRS
As we will be using this list, it is often helpful to have the values in numerical order. We can do this with the sort()
function.
The syntax of sort
is
sort( x, decreasing ) #Will only run if x is appropriately defined
where the argument is:
x
: The vector of elements which you are sorting.decreasing
: Should the sort be increasing or decreasing? The default isFALSE
which arrangesx
in ascending order while settingdecreasing = TRUE
will arrangex
in descending order.
If you simply run
sort( x )
, it will return the vectorx
in ascending order.If the values of
x
are not numeric, this can sometimes give puzzling results.The
sort
function has many other uses than the one shown here and the curious reader can see the help file at?sort
to see other uses.
Returning to our example, we originally got a SRS comprised of the values 20, 25, and 14. While it is trivial in this case, we use it as an example of how to automate the process. We can now use the code:
sort( sample( x = 1:25, size = 3 ) )
## [1] 14 20 25
To get the values 14, 20, and 25.
We summarize the above work below.
To find a SRS of size n
from a population of size N
, we use the code:
#N and n must be entered or defined to run
sort( sample( x = 1:N, size = n ) )
or more succinctly
#N and n must be entered or defined to run
sort( sample( 1:N, n ) )
Example 2.16 (Choosing a Random Lineup) Let’s pretend we want to choose a SRS of the 1995-1996 Chicago Bulls basketball team to play the last 2 seconds of a blowout victory. That is, we want a SRS of size n = 5
from the 1996 Bulls roster. First we begin by downloading the roster from the internet.
<- read.csv( "https://statypus.org/files/ChicagoBulls1996.csv") Bulls1996
Which gives us the following roster:
1:7 ] Bulls1996[ ,
## No. Player Pos Ht Wt Birth.Date Exp
## 1 0 Randy Brown PG 74 190 May 22, 1968 4
## 2 30 Jud Buechler SF 78 220 June 19, 1968 5
## 3 35 Jason Caffey PF 80 255 June 12, 1973 0
## 4 53 James Edwards C 84 225 November 22, 1955 18
## 5 54 Jack Haley C 82 240 January 27, 1964 6
## 6 9 Ron Harper PG 78 185 January 20, 1964 9
## 7 23 Michael Jordan SG 78 198 February 17, 1963 10
## 8 25 Steve Kerr PG 75 175 September 27, 1965 7
## 9 7 Toni Kukoč SF 82 192 September 18, 1968 2
## 10 13 Luc Longley C 86 265 January 19, 1969 4
## 11 33 Scottie Pippen SF 80 210 September 25, 1965 8
## 12 91 Dennis Rodman PF 79 210 May 13, 1961 9
## 13 22 John Salley PF 83 230 May 16, 1964 9
## 14 8 Dickey Simpkins PF 81 248 April 6, 1972 1
## 15 34 Bill Wennington C 84 245 April 26, 1963 8
We have chosen to omit column 8 here in order for it to better fit most screens. Bulls1996[ , 8]
is the same as Bulls1996$College
and contains the college that each player went to. (Well, except for Toni Kukoč, who turned pro at the age of 17 in his native country of Croatia.)
We then note that N = 15
(the size of the population, that is the roster size) and n = 5
(the size of the sample we want) and use the code:
<- sort( sample( x = 1:15, size = 5 ) )
SRSBulls SRSBulls
## [1] 5 8 9 10 14
We then return to our enumerated list of the players and select players on the rows listed above to get the following SRS:
Bulls1996[ SRSBulls , ]
## No. Player Pos Ht Wt Birth.Date Exp College
## 5 54 Jack Haley C 82 240 January 27, 1964 6 UCLA
## 8 25 Steve Kerr PG 75 175 September 27, 1965 7 Arizona
## 9 7 Toni Kukoč SF 82 192 September 18, 1968 2
## 10 13 Luc Longley C 86 265 January 19, 1969 4 New Mexico
## 14 8 Dickey Simpkins PF 81 248 April 6, 1972 1 Providence
This gives all of the information about the rows which were in our sample. If we only wanted the players names, we could use the following code:
"Player" ] Bulls1996[ SRSBulls ,
## [1] "Jack Haley" "Steve Kerr" "Toni Kukoč" "Luc Longley"
## [5] "Dickey Simpkins"
2.5.2 Other Sampling Methods
The concept of a simple random sample is absolutely crucial for most modern statistical techniques. However, as it even says in its name, it is a very “simple” process that can leave a researcher wanting to have more control. There are many different advanced sampling methods that each offer their own advantages and requisite disadvantages which can be used in many fields of research.
2.5.2.1 Stratified Sampling
If a researcher wants to be able to draw conclusions from their work about any subgroup of a population, such as females, they must ensure that their sample represents the population with respect to the characteristics that define that subgroup. We could simply restrict and call this subgroup our new population or we could decide to look at a collection of subgroups for which individuals fall into.
Definition 2.15 If we break a population into subgroups so that each individual falls into exactly one group, we call the collection of subgroups the Strata and an individual subgroup a Stratum.
Example 2.17 Gender is an obvious set of strata that can be used to break down a population. We could further break down a college student population by the student year (Freshman, Sophomore, etc.) of each student. In this instance, “Female” would not be a single stratum, and the strata would actually be more complex subgroups such as “Female Freshmen” and “Male Sophomores.” This shows that it is up to the researcher to properly define the strata that they want to investigate.
Definition 2.16 Once a population has been broken into the desired strata, a Stratified Sample can be found by finding a simple random sample from each stratum. Typically, the size of the sample from each stratum should be proportionate to the proportion of the overall population that the stratum represents.
One of the benefits of a stratified sample is that it can allow a researcher to gather information about each stratum separately while also gathering information about the entire population.
Surveys/polls are often given to investigate if certain referendums will be passed during an election. If you were trying to gather a sample to determine if a specific referendum will pass during an upcoming election, what strata would you want to break the population into?
Review
Definitions
Section 2.1
Definition 2.1
A Population is a collection of all individuals that we wish to understand.
A Sample is any subset of a population.
Definition 2.2
A Parameter is a numerical value that describes some characteristic of a population.
A Statistic is a numerical value that describes some characteristic of a sample.
Definition 2.3
A Census is the process of collecting information from an entire population.
A Survey is the process of collecting information from the sample of a population.
Section 2.2
Definition 2.4
A Quantitative Variable is a variable that has data for which arithmetic calculations offer meaningful information about the variable.
Definition 2.5
A Qualitative Variable, sometimes called a Categorical Variable, is any variable where arithmetic calculations are impossible or where they would not offer any meaningful information. The values that the variable takes on are called Categories.
Definition 2.6
A Discrete Quantitative Variable or simply a Discrete Variable is a quantitative variable that takes on only integer values.29
Definition 2.7
A Continuous Quantitative Variable or simply Continuous Variable is a quantitative variable that can take on any value in some interval of the real line.
Section 2.3
Definition 2.8
An Observational Study is a form of statistical research where an inference is made about a population from a sample where the researcher has no control over the individuals in the study.
Definition 2.9
A Longitudinal Study is a type of observational study that makes repeated measurements of the same variable(s) over a long period of time.
Definition 2.10
A Statistical Experiment or simply an Experiment is a form of statistical research where the researcher attempts to control every variable except for the ones being examined.
Section 2.4
Definition 2.11
Bias is any systematic tendency for a statistical calculation to give an inaccurate estimate about a population.
Definition 2.12
Sampling Bias is any bias that results from the methods and techniques used to gather a sample.
Definition 2.13
Survey Bias is any bias that results from the process that begins when a sample is determined and ends with a calculation being made.
Section 2.5
Definition 2.14
The process of creating a Simple Random Sample, or SRS, of size \(n\) from a group of \(N\) items means that we choose a subset of \(n\) items such that every subset of size \(n\) is equally likely.
Definition 2.15
If we break a population into subgroups so that each individual falls into exactly one group, we call the collection of subgroups the Strata and an individual subgroup a Stratum.
Definition 2.16
Once a population has been broken into the desired strata, a Stratified Sample can be found by finding a simple random sample from each stratum. Typically, the size of the sample from each stratum should be proportionate to the proportion of the overall population that the stratum represents.
Big Ideas
Section 2.1
A major theme of statistics is trying to understand a parameter for a certain population. However, achieving a true census and actually ascertaining the real value of a parameter is often difficult if not impossible.
A major portion of statistical research involves trying to find a statistic to estimate a given parameter and then offering some level of certainty as to the accuracy of the estimate.
Section 2.4
It is the view of us here at Statypus that bias can never be completely eliminated as some sources are often unknown. However, it should always be the goal of a researcher to minimize any source of bias as much as is practical.
Section 2.5
One of the benefits of a stratified sample is that it can allow a researcher to gather information about each stratum separately while also gathering information about the entire population.
Important Alerts
There are numerous books and courses being taught on Research Design at both the undergraduate and graduate level. There are numerous degrees, even at the graduate level, for each of the topics in Sections 2.3, 2.4, and 2.5. As such, these sections should not be seen as exhaustive and only as a brief introduction into concepts we will essentially take for granted here at Statypus. That is, we will nearly always begin our investigations after a sample has already been found and surveyed.
Important Remarks
Section 2.1
Example 2.1 shows that the process of explicitly defining a population is essentially identical to explicitly defining an individual. While we framed the question as to what the population was, we actually were defining the characteristics that defined whether a person was an individual of the population or not.30 In practice, the definitions of individual and population are essentially different views of the same underlying concept.
Section 2.2
A continuous variable is a variable where the value can take on any value within a range of real numbers. However, nearly all continuous variables are saved as discrete values once we measure it to a certain level of precision. That is, the underlying variable is continuous, but we record the data as a discrete value.
Code Templates
Section 2.5
To find a SRS of size n
from a population of size N
, we use the code:
#N and n must be entered or defined to run
sort( sample( x = 1:N, size = n ) )
or more succinctly
#N and n must be entered or defined to run
sort( sample( 1:N, n ) )
Exercises
Exercise 2.1 What code would you use to select a simple random sample of United States senators of size 17? [Note: At the time of this being written, there were 100 US Senators.]
Exercise 2.2 Find a SRS of size 3 from the population of the people who have stepped foot on the moon. Explain all the steps used to find your SRS including any code used.
Exercise 2.3 Consider the dataset mtcars
. Find a SRS of size n = 5
of the individuals in the dataset and display the information for just this sample. The result of one possible sample is shown below. If your sample is identical, please choose another sample.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Exercise 2.4 Using the dataset mtcars
, find a SRS of size n = 5
of the vehicles that had an automatic transmission and display the information for just this sample.
De Plater, G.; Martin, R. L.; Milburn, P. J. (1995). “A pharmacological and biochemical investigation of the venom from the platypus (Ornithorhynchus anatinus)”. Toxicon. 33 (2): 157–69.↩︎
For the more mathematically curious reader, this is the same as defining a set based on a rule that dictates whether an item is an element of the set or not.↩︎
See this article for a further dive into the response rates for the 2000, 2010, and 2020 censuses.↩︎
We say up to 10 variables for each individual because as we mentioned in Section 1.5.3, datasets often have missing values.↩︎
The terms Imbecility and Monomania are likely not familiar to the modern reader. They are obsolete terms and no longer used in modern psychiatry.↩︎
If the notation in the R code looks confusing, see Example 1.10 for an example of accessing individual values within a data frame.↩︎
In reality, discrete variables can actually take on more than integer values, but in this book. we will restrict our attention to discrete variables that take on integer and typically whole number values.↩︎
You can see this article if you want to investigate this curious piece of history further.↩︎
In reality, discrete variables can actually take on more than integer values, but in this book. we will restrict our attention to discrete variables that take on integer and typically whole number values.↩︎
For the more mathematically curious reader, this is the same as defining a set based on a rule that dictates whether an item is an element of the set or not.↩︎