Preface

Figure 0.2: A Statypus Reading its Favorite Book in Nature
A platypus may enter a state of something called “torpor” during the colder months, which is similar to hibernation, for periods of up to six days at a time.4
0.1 Philosophy
Students should be critical consumers of statistics. This means that they are able to discuss possible issues with statistical experiments/studies and question the validity of data.
Students should be thoughtful producers of statistics. This means that they are able to make calculations and produce inferences that they can explain clearly to others.
Instruction should focus on the interpretation of statistical calculations rather than testing a student’s ability to memorize equations and formulas.
We learn by making mistakes. I have taught children that mistakes are an integral part of the learning process and not something to be seen as detrimental. Students should Dare to be Wrong and interact in a way that they do not feel encumbered by the fear of making mistakes. Too many students sit quietly unable to bring themselves to ask the many questions they do have. My classroom and this book are meant to be safe places where all ideas are entertained and where no one is ever ridiculed for making a mistake.
Computers should be used as an integral tool of instruction and concepts should be taught through the lens of using a computer.
0.2 AI Use in Elementary Undergraduate Statistics
The world of statistics education is changing fast as AI tools that can generate computer code have become incredibly tempting to the modern student. These powerful assistants churn out code scripts at frightening speed, making many wonder if learning to code manually is even necessary anymore. But for elementary undergraduate students, relying on AI for coding can actually block them from building crucial foundational skills.
We should apply the same pedagogical philosophy to coding in elementary undergraduate statistics. When students are first grappling with data manipulation, statistical analysis, and visualization using programming languages such as R, writing code by hand is absolutely essential. This hands-on path of discovery compels them to:
Grasp Core Logic: When they manually tackle datasets, interact with the command line, and improve their syntax, students are forced to think critically about every step. They have to learn how the computer processes instructions and (hopefully) move beyond just searching for “the answer.”
Learn From Mistakes: In coding, errors are simply part of the process. When students write their own code, they have to debug it, scouring over their work to pinpoint and fix problems. This back-and-forth process isn’t just about coding. It is invaluable for building problem-solving abilities and confidence that reach far beyond the screen.
Build Foundational Understanding: When students truly understand why a piece of code functions (or fails), it deepens their appreciation for statistical concepts. Manually walking through the code for a confidence interval solidifies the idea of parameter estimation in a way that simply asking an AI for the code never could.
Avoid the “Black Box” Mentality: If AI constantly churns out code, students might start seeing the programming environment as a “black box” – a mysterious device that just spits out answers without the user needing to understand how. This can stunt genuine learning and the ability to critically evaluate results.
Relying on AI code generation too early can effectively short-circuit this crucial learning process. Students might get proficient at prompting AI to get the output they want, but then struggle to interpret, tweak, or debug the generated code when new problems or surprises pop up. They may find it hard to adapt their skills to different situations or fresh coding challenges because they simply don’t have the core skills. Plus, they might not even realize if the “correct” answer from the AI truly matches what they were asking, or if there’s a subtle misunderstanding between their thoughts and the AI’s output.
AI is certainly a powerful tool in advanced computational statistics and data science. For experienced users, AI can accelerate development, automate repetitive tasks, and even suggest novel approaches. However, just as a seasoned engineer uses advanced simulation software only after learning the principles of physics, students should leverage AI only after they’ve built a solid foundation in basic programming logic and statistical computation through hands-on practice.
To wrap things up, the way children learn arithmetic offers a cautionary tale for elementary undergraduate statistics students diving into coding. Many people struggle with math concepts in algebra and beyond because of their lack of foundational understanding of arithmetic. When elementary statistics students truly grapple with the syntax, logic, and debugging hurdles of basic programming, they build a deep, resilient understanding that goes way beyond just generating output. This foundational experience is absolutely critical for sharpening their critical thinking, developing genuine problem-solving skills, and ultimately, for becoming both critical consumers and thoughtful producers in our increasingly data-driven world.
AI offers unarguably incredible and ever improving sets of tools, but AI is most beneficial as a tool used by free thinkers and shouldn’t be allowed to stunt someone’s learning of statistics.
0.3 Acknowledgements
The process of writing a book is a very humbling experience, especially someone’s first book. My own personal experience was a winding walk through periods of absolute confidence and determination as well as darker periods of self-doubt and lack of belief to accomplish this goal. There is absolutely no way that this book would not exist without the help of so many people.
I am forever indebted to Darrin Speegle and Bryan Clair whose book Probability, Statistics, and Data was a very early inspiration to develop my own materials which eventually turned into this book. The first part of this book, Section 6.3, is a modified version of Section 2.2 of Darrin and Bryan’s book. In addition, Chapter 1 of Statypus is a modified version of their Chapter 1. I have learned so many techniques and been pulled away from so many bad habits by learning from both of them. From simply allowing me to see how they coded certain parts of their book to countless discussions about how certain topics should be taught, this book would not exist without them. Darrin is definitely responsible for pushing me to rethink introductory statistics within a mindset of using the technology correctly. I am sure Darrin sometimes dreaded seeing me walk towards his office door with another question that was painfully simple for him, but I could not have done this without his help. Bryan was the person who literally told me to turn my early PDF documents into a bookdown. Without that nudge, this project would never have gotten to this point. He has also been there for numerous impromptu conversations about statistical issues and concepts as well as many technical explanations to allow me to create this book. I would have no idea what a “cascading style sheet” was without him, but Bryan offered an example of his own and helped me understand how to use it.
I would also like to thank Anneke Bart. She was the chair of the Mathematics and Statistics Department during the time this book was written as has always been a true friend. She offered countless hours within her busy schedule to allow me to discuss this project as well as using some of my early PDF documents in her own course.
While I am absolutely appreciative of the efforts and time from each of these three amazing professors, I wish to thank them more for offering me respect and treating me as a friend and colleague regardless of how lost in the weeds I would often become.
I am also extremely thankful to my family. My focus was definitely pulled away from them for the past year to a degree which I have often regretted. Even when I was able to put the laptop down and be a husband or father, I know I did not always offer them the focus that they rightfully deserve. My children each make an appearance in this book and I hope that they can accept that as my apologies for not always actively listening to them like I should have.
My wife, Alie, has been the absolute bedrock from which I have drawn strength throughout the process of writing this book. Alie is one of the kindest individuals I have ever met and absolutely brilliant (other than in her choice of spouse). There has never been a single moment where she has not shown absolute belief in me and I can say unequivocally that I could not have done this without her. Alie, I thank you for putting up with the subpar husband I have been as I have worked on this book. I don’t think any man is deserving of you, but I am so grateful to be the person you have chosen to spend your life with.
0.4 Parts of this Book
As you navigate through the numerous webpages that make up Statypus, you will encounter many different colored boxes which set aside a portion of the screen for a specific task. Knowing the purpose of these different boxes can assist you in understanding what is being talked about and expedite your ability to find something you are looking for within Statypus. The following gives examples of the different times of boxes you will encounter here.
0.4.1 Alert
As mentioned earlier, this book accepts mistakes as an important and unavoidable part of the learning process. That being said, the purpose of making mistakes is to be able to avoid making similar mistakes in the future and where possible, a good teacher tries to offer a cautionary tale about common mistakes students make. Red Alert boxes, like the one shown below from Section 9.3.2 do just this. They offer a cautionary warning to be aware of easy mistakes that can be made while covering certain topics.
It is important to not confuse the different uses of \(p\) here. We have a population proportion which we denoted \(p\) and now a measure of how strong evidence based on a sample which we call the \(p\)-value. We also use \(P\) for probability calculations. We will always write out \(p\)-value to avoid confusion as much as possible.
0.4.2 Big Idea
There are a lot of students who can memorize nearly any equation given to them and many can use them effortlessly and without err. However, a lot of students struggle to answer questions of the form, “What does that mean?”, when asked to put things into context. A good instructor should be the person who can facilitate a student’s deeper understanding of what something means and not simply recite notes from a piece of paper which a student then transcribes to their own paper (or iPad?) which they then compare to their textbook and find very little difference. This book attempts to do just that with the green Big Idea boxes like the one below from Section 9.3.2. Big Idea boxes try to offer concepts in a way that are meant as a sort of heuristic view of a complex concept.
Loosely, we can take the \(p\)-value to represent how much you can still believe the assumption made in \(H_0\) after examining the evidence against it. If we believe in the statement(s) made in \(H_0\) prior to running the test, then we can view \(p\)-value as how much belief we still have in after examining the evidence provided by our sample. We will soon discuss how little belief is acceptable before we are forced to reject \(H_o\). However, it is important to also remember that this is just a loose way to make sense of it and that the technical meaning is given in Definition 9.10.
0.4.3 Code Template
Coding is hard and computers don’t care about what you “meant.” They only care what you explicitly tell them to do. For example, the code view( mtcars )
will cause an error in R because it should be View( mtcars )
, with an upper case V. Human beings (even lowly textbook authors) are not perfect and typos are just a matter of “when” and not “if.” To minimize the number of simple typographical errors students encounter, it is helpful if they can “borrow” code that they know will work and be able to adapt it to their own needs rather than asking them to write new code on their own. Green Code Template boxes such as the one below from Section 3.3.3 do just this. The user should be able to automatically copy the contents of the lighter colored boxes by placing their cursor over the upper right hand portion of the box. This allows students to easily migrate code from Statypus directly into their own work with minimal concerns about typing errors.
To make a stem and leaf plot of a vector x
, you use the code:
stem( x ) #Will only run if x is defined
or if we have a column of a dataframe, we would use:
stem( df$Col ) #Will only run if df is appropriately defined
0.4.4 Data Download
Statistics can be simply thought of as the science of working with data to understand our world. For most people, there is no need to discuss the concepts in a statistics course unless it relates so some sort of data. Getting that data in a way that is easily used can sometimes be tricky and with a myriad of formats out there, modern computing has made this even trickier in some ways. However, we try to minimize this with purple Data Download boxes such as the one below from Example 4.1. These offer code that you can copy and paste which will automatically download the data from the Statypus servers and move it into their RStudio environment.
<- read.csv( "https://statypus.org/files/BabyData1.csv" ) BabyData1
0.4.5 Example
If mistakes are an important part of learning, then examples are even more important. Blue Example boxes like the one below, Example 4.4, offer us a way keep track when we leave abstract concepts and begin to work on a specific application.
Example 0.1 If we wanted to find the range of birth masses of babies in our sample, we can do this with the following code.
range( BabyData1$weight )
## [1] 907 4825
This shows us that the smallest baby in our sample had a mass 907 grams while the largest had a mass of 4825 grams.
The range length is thus \(4825 - 907 = 3918\) grams.
0.4.6 Let’s Explore
Most good mathematicians can “see” math happen in their heads. For example, envision two vertical poles situated a certain distance apart. Further imagine that a wire is connected from the top of each pole to the base of the other. The two wires would obviously cross and a simple (at least simple to ask) question would be: “How high is the intersection of the two wires in terms of the distances and lengths of the poles and wires?” If you had a Ph.D. in geometry, you may be able to see this entire image in your head, but most people would need to make a sketch of the figure to understand what is going on. However, this problem requires us to consider the figure where we do not know any of the distances or lengths. Light blue Let’s Explore boxes attempt to offer just such a tool. The exploration below gives an interactive visualization of exactly the problem we just setup here. The answer, however, is left for you to figure out!5
0.4.7 New Function
Using computer software such as R requires us to use functions that are built into its system. Pink New Function boxes like the one below from Section 5.1 give a place to begin your understanding of how these software functions work. They are meant to offer the important information about the function and how to use it before we begin to actually enter values or data into them.
The syntax of plot
is
plot( x, y, type )
where the parameters are:
type
: Sets what type of plot should be drawn. See?plot
for a full list of options
The first vector entered, x
, is graphed on the horizontal axis and the second vector entered, y
, is graphed on the vertical axis.
0.4.8 New Functions
Each chapter begins with a list of the new functions it will introduce with a pink New Functions box like the one found at the beginning of Chapter 3. This offers students (and instructors) a quick place to reference where different functions were introduced. Functions in these boxes should be in the order that they appear within the text.
table()
: Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.proportions()
: Returns conditional proportions given entries ofx
divided by the appropriate sum(s).barplot()
: Creates a bar plot with vertical or horizontal bars.hist()
: Computes a histogram of the given data values.stem()
: Produces a stem-and-leaf plot of the values inx
.plot()
: Draw a scatter plot with decorations such as axes and titles in the active graphics window.
0.4.9 Now It’s Your Turn!
If you read an entire book on the theory of how to properly shoot a basketball, would that improve your ability to make a free throw? The answer is probably not unless you actually took time to also practice the concepts you are learning. Now there are exercises at the end of every chapter (everyone loves homework), but the yellow Now It’s Your Turn! boxes, the one one below from Section 3.2.2, offer a low stakes way for students to check if they are grasping the material as they go.
Make a comparative bar chart of the number of gears a car had based on the shape of the engine. The two variables are gear
which gives the number of forward gears a car had and vs
which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside
parameter. Can you see a relationship between the variables?
0.4.10 Platypus Oddity
The platypus is weird… there’s no way to get around that. However, so are most mathematicians. We celebrate the uniqueness of the platypus with a gray Platypus Oddity box at the beginning of each chapter, like the one found below and at the beginning of this chapter, immediately after sharing a glimpse of a Statypus (a statistics loving platypus) entirely to bring humor and happiness to the reader!
A platypus may enter a state of something called “torpor” during the colder months, which is similar to hibernation, for periods of up to six days at a time.
0.4.11 Definition
It is “turtles all the way down” as the saying goes. To make any headway in mathematics or statistics, we must begin with defining certain things and the green Definition boxes, like the one below for Definition 4.1, do just that. While not as fun as a fun fact about a leg laying mammal, definitions cannot be left out. Some definitions here may not match ones you may have learned in the past and that is fine. It is up to the author of each book to define what terms mean within the pages (webpages, I guess) of their book.
Definition 0.1 Given a vector \({\bf x} = ( x_1, x_2, \ldots, x_n )\) having \(n\) values, we can define the arithmetic mean or simply mean of \({\bf x}\), which we denote as \(\bar{x}\) if \({\bf x}\) is a sample or \(\mu\) if \({\bf x}\) is the entire population, as follows. \[\bar{x} \text{ or }\mu= \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \left( x_1 + x_2 + \cdots + x_n \right) = \frac{ x_1 + x_2 + \cdots + x_n }{n}.\]
0.4.12 Remark
Sometimes a point needs to be made and stand out, but it’s not a potential mistake nor does it fit into any of the other categories of boxes we have here at Statypus. Orange Remark boxes like the one below from Section 9.3.2 fill this gap. They will contain a wide array of ideas and concepts that students should pay attention to.
Remark. The following definition is an interpretation of the “informal” definition of a \(p\)-value as given by the American Statistical Association6. Unfortunately, a rigorous definition is not easily given, nor is its interpretation fully agreed upon.
0.4.13 Theorem
“Mathematicians turn coffee into theorems” is an old adage and isn’t necessarily untrue, although the author of this book prefers tea! The theorems of this book will appear in bright green Theorem boxes like the one below for Theorem 5.1. If it’s a theorem, it’s probably important.
Theorem 0.1 The correlation, \(r\), between two vectors, \({\bf x}\), and \({\bf y}\), satisfies the following:
\(-1 \leq r \leq 1\)
\(|r| =1\) means that the ordered pairs \(\big\{ (x_i, y_i)\big\}\) are collinear.
Correlation is a symmetric operation. That is, the correlation of \({\bf x}\) and \({\bf y}\) is the same as the the correlation of \({\bf y}\) and \({\bf x}\), i.e. the order of the vectors does not matter.
If you really must know the solution to the above problem, you can look here.↩︎
See this article for more information and further discussion about the difficulty in using if not just defining a \(p\)-value.↩︎