Preface

Figure 0.2: A Statypus Reading its Favorite Book in Nature

A platypus may enter a state of something called “torpor” during the colder months, which is similar to hibernation, for periods of up to six days at a time.⁴

0.1 Philosophy

This book has been written for an introductory course in computer aided statistics without the reliance on any calculus. All code has been written as to be executable on a “fresh” R environment by relying on functions and techniques that do not require the implementation of packages. The few exceptions are the original functions introduced in Chapters 5 and 10 as well as a few others in the later chapters. However, care has been given to make the code as readable as possible so that the enthusiastic reader can follow as much as is possible. Data manipulation is done with “base” R techniques which may appear cumbersome to advanced users who are familiar with packages such as dplyr or ggplot2.

Here at Statypus, we strive to offer a supportive learning environment which is anchored on the following principles:

Students should be critical consumers of statistics. This means that they are able to discuss possible issues with statistical experiments/studies and question the validity of data.
Students should be thoughtful producers of statistics. This means that they are able to make calculations and produce inferences that they can explain clearly to others.
Instruction should focus on the interpretation of statistical calculations rather than testing a student’s ability to memorize equations and formulas.
We learn by making mistakes. I have taught children that mistakes are an integral part of the learning process and not something to be seen as detrimental. Students should Dare to be Wrong and interact in a way that they do not feel encumbered by the fear of making mistakes. Too many students sit quietly unable to bring themselves to ask the many questions they do have. My classroom and this book are meant to be safe places where all ideas are entertained and where no one is ever ridiculed for making a mistake.
Computers should be used as an integral tool of instruction and concepts should be taught through the lens of using a computer.

0.2 AI Use in Elementary Undergraduate Statistics

The world of statistics education is changing fast as AI tools that can generate computer code have become incredibly tempting to the modern student. These powerful assistants churn out code scripts at frightening speed, making many wonder if learning to code manually is even necessary anymore. But for elementary undergraduate students, relying on AI for coding can actually block them from building crucial foundational skills.

We should apply the same pedagogical philosophy to coding in elementary undergraduate statistics. When students are first grappling with data manipulation, statistical analysis, and visualization using programming languages such as R, writing code by hand is absolutely essential. This hands-on path of discovery compels them to:

Grasp Core Logic: When they manually tackle datasets, interact with the command line, and improve their syntax, students are forced to think critically about every step. They have to learn how the computer processes instructions and (hopefully) move beyond just searching for “the answer.”
Learn From Mistakes: In coding, errors are simply part of the process. When students write their own code, they have to debug it, scouring over their work to pinpoint and fix problems. This back-and-forth process isn’t just about coding. It is invaluable for building problem-solving abilities and confidence that reach far beyond the screen.
Build Foundational Understanding: When students truly understand why a piece of code functions (or fails), it deepens their appreciation for statistical concepts. Manually walking through the code for a confidence interval solidifies the idea of parameter estimation in a way that simply asking an AI for the code never could.
Avoid the “Black Box” Mentality: If AI constantly churns out code, students might start seeing the programming environment as a “black box” – a mysterious device that just spits out answers without the user needing to understand how. This can stunt genuine learning and the ability to critically evaluate results.

Relying on AI code generation too early can effectively short-circuit this crucial learning process. Students might get proficient at prompting AI to get the output they want, but then struggle to interpret, tweak, or debug the generated code when new problems or surprises pop up. They may find it hard to adapt their skills to different situations or fresh coding challenges because they simply don’t have the core skills. Plus, they might not even realize if the “correct” answer from the AI truly matches what they were asking, or if there’s a subtle misunderstanding between their thoughts and the AI’s output.

AI is certainly a powerful tool in advanced computational statistics and data science. For experienced users, AI can accelerate development, automate repetitive tasks, and even suggest novel approaches. However, just as a seasoned engineer uses advanced simulation software only after learning the principles of physics, students should leverage AI only after they’ve built a solid foundation in basic programming logic and statistical computation through hands-on practice.

To wrap things up, the way children learn arithmetic offers a cautionary tale for elementary undergraduate statistics students diving into coding. Many people struggle with math concepts in algebra and beyond because of their lack of foundational understanding of arithmetic. When elementary statistics students truly grapple with the syntax, logic, and debugging hurdles of basic programming, they build a deep, resilient understanding that goes way beyond just generating output. This foundational experience is absolutely critical for sharpening their critical thinking, developing genuine problem-solving skills, and ultimately, for becoming both critical consumers and thoughtful producers in our increasingly data-driven world.

AI offers unarguably incredible and ever improving sets of tools, but AI is most beneficial as a tool used by free thinkers and shouldn’t be allowed to stunt someone’s learning of statistics.

0.3 Acknowledgements

The process of writing a book is a very humbling experience, especially someone’s first book. My own personal experience was a winding walk through periods of absolute confidence and determination as well as darker periods of self-doubt and lack of belief to accomplish this goal. There is absolutely no way that this book would not exist without the help of so many people.

I am forever indebted to Darrin Speegle and Bryan Clair whose book Probability, Statistics, and Data was a very early inspiration to develop my own materials which eventually turned into this book. The first part of this book, Section 6.3, is a modified version of Section 2.2 of Darrin and Bryan’s book. In addition, Chapter 1 of Statypus is a modified version of their Chapter 1. I have learned so many techniques and been pulled away from so many bad habits by learning from both of them. From simply allowing me to see how they coded certain parts of their book to countless discussions about how certain topics should be taught, this book would not exist without them. Darrin is definitely responsible for pushing me to rethink introductory statistics within a mindset of using the technology correctly. I am sure Darrin sometimes dreaded seeing me walk towards his office door with another question that was painfully simple for him, but I could not have done this without his help. Bryan was the person who literally told me to turn my early PDF documents into a bookdown. Without that nudge, this project would never have gotten to this point. He has also been there for numerous impromptu conversations about statistical issues and concepts as well as many technical explanations to allow me to create this book. I would have no idea what a “cascading style sheet” was without him, but Bryan offered an example of his own and helped me understand how to use it.

I would also like to thank Anneke Bart. She was the chair of the Mathematics and Statistics Department during the time this book was written as has always been a true friend. She offered countless hours within her busy schedule to allow me to discuss this project as well as using some of my early PDF documents in her own course.

While I am absolutely appreciative of the efforts and time from each of these three amazing professors, I wish to thank them more for offering me respect and treating me as a friend and colleague regardless of how lost in the weeds I would often become.

I am also extremely thankful to my family. My focus was definitely pulled away from them for the past year to a degree which I have often regretted. Even when I was able to put the laptop down and be a husband or father, I know I did not always offer them the focus that they rightfully deserve. My children each make an appearance in this book and I hope that they can accept that as my apologies for not always actively listening to them like I should have.

My wife, Alie, has been the absolute bedrock from which I have drawn strength throughout the process of writing this book. Alie is one of the kindest individuals I have ever met and absolutely brilliant (other than in her choice of spouse). There has never been a single moment where she has not shown absolute belief in me and I can say unequivocally that I could not have done this without her. Alie, I thank you for putting up with the subpar husband I have been as I have worked on this book. I don’t think any man is deserving of you, but I am so grateful to be the person you have chosen to spend your life with.

0.4 Parts of this Book

As you navigate through the numerous webpages that make up Statypus, you will encounter many different colored boxes which set aside a portion of the screen for a specific task. Knowing the purpose of these different boxes can assist you in understanding what is being talked about and expedite your ability to find something you are looking for within Statypus. The following gives examples of the different times of boxes you will encounter here.

0.4.1 Alert

As mentioned earlier, this book accepts mistakes as an important and unavoidable part of the learning process. That being said, the purpose of making mistakes is to be able to avoid making similar mistakes in the future and where possible, a good teacher tries to offer a cautionary tale about common mistakes students make. Red Alert boxes, like the one shown below from Section 9.3.2 do just this. They offer a cautionary warning to be aware of easy mistakes that can be made while covering certain topics.

It is important to not confuse the different uses of \(p\) here. We have a population proportion which we denoted \(p\) and now a measure of how strong evidence based on a sample which we call the \(p\)-value. We also use \(P\) for probability calculations. We will always write out \(p\)-value to avoid confusion as much as possible.

0.4.2 Big Idea

There are a lot of students who can memorize nearly any equation given to them and many can use them effortlessly and without err. However, a lot of students struggle to answer questions of the form, “What does that mean?”, when asked to put things into context. A good instructor should be the person who can facilitate a student’s deeper understanding of what something means and not simply recite notes from a piece of paper which a student then transcribes to their own paper (or iPad?) which they then compare to their textbook and find very little difference. This book attempts to do just that with the green Big Idea boxes like the one below from Section 9.3.2. Big Idea boxes try to offer concepts in a way that are meant as a sort of heuristic view of a complex concept.

Loosely, we can take the \(p\)-value to represent how much you can still believe the assumption made in \(H_0\) after examining the evidence against it. If we believe in the statement(s) made in \(H_0\) prior to running the test, then we can view \(p\)-value as how much belief we still have in after examining the evidence provided by our sample. We will soon discuss how little belief is acceptable before we are forced to reject \(H_o\). However, it is important to also remember that this is just a loose way to make sense of it and that the technical meaning is given in Definition 9.10.

0.4.3 Code Template

Coding is hard and computers don’t care about what you “meant.” They only care what you explicitly tell them to do. For example, the code view( mtcars ) will cause an error in R because it should be View( mtcars ), with an upper case V. Human beings (even lowly textbook authors) are not perfect and typos are just a matter of “when” and not “if.” To minimize the number of simple typographical errors students encounter, it is helpful if they can “borrow” code that they know will work and be able to adapt it to their own needs rather than asking them to write new code on their own. Green Code Template boxes such as the one below from Section 3.3.3 do just this. The user should be able to automatically copy the contents of the lighter colored boxes by placing their cursor over the upper right hand portion of the box. This allows students to easily migrate code from Statypus directly into their own work with minimal concerns about typing errors.

To make a stem and leaf plot of a vector x, you use the code:

#Will only run if x is defined
stem( x )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
stem( df$Col )

0.4.4 Data Download

Statistics can be simply thought of as the science of working with data to understand our world. For most people, there is no need to discuss the concepts in a statistics course unless it relates so some sort of data. Getting that data in a way that is easily used can sometimes be tricky and with a myriad of formats out there, modern computing has made this even trickier in some ways. However, we try to minimize this with purple Data Download boxes such as the one below from Example 4.1. These offer code that you can copy and paste which will automatically download the data from the Statypus servers and move it into their RStudio environment.

Use the following code to download BabyData1.

BabyData1 <- read.csv( "https://statypus.org/files/BabyData1.csv" )

0.4.5 Example

If mistakes are an important part of learning, then examples are even more important. Blue Example boxes like the one below, Example 4.4, offer us a way keep track when we leave abstract concepts and begin to work on a specific application.

Example 0.1 If we wanted to find the range of birth masses of babies in our sample, we can do this with the following code.

range( BabyData1$weight )

## [1]  907 4825

This shows us that the smallest baby in our sample had a mass 907 grams while the largest had a mass of 4825 grams.

The range length is thus \(4825 - 907 = 3918\) grams.

0.4.6 Let’s Explore

Most good mathematicians can “see” math happen in their heads. For example, envision two vertical poles situated a certain distance apart. Further imagine that a wire is connected from the top of each pole to the base of the other. The two wires would obviously cross and a simple (at least simple to ask) question would be: “How high is the intersection of the two wires in terms of the distances and lengths of the poles and wires?” If you had a Ph.D. in geometry, you may be able to see this entire image in your head, but most people would need to make a sketch of the figure to understand what is going on. However, this problem requires us to consider the figure where we do not know any of the distances or lengths. Light blue Let’s Explore boxes attempt to offer just such a tool. The exploration below gives an interactive visualization of exactly the problem we just setup here. The answer, however, is left for you to figure out!⁵

0.4.7 New Function

Using computer software such as R requires us to use functions that are built into its system. Pink New Function boxes like the one below from Section 5.1 give a place to begin your understanding of how these software functions work. They are meant to offer the important information about the function and how to use it before we begin to actually enter values or data into them.

The syntax of plot is

plot( x, y, type )

where the arguments are:

type: Sets what type of plot should be drawn. See ?plot for a full list of options

The first vector entered, x, is graphed on the horizontal axis and the second vector entered, y, is graphed on the vertical axis.

0.4.8 New Functions

Each chapter begins with a list of the new functions it will introduce with a pink New Functions box like the one found at the beginning of Chapter 3. This offers students (and instructors) a quick place to reference where different functions were introduced. Functions in these boxes should be in the order that they appear within the text.

table(): Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
proportions(): Returns conditional proportions given entries of x divided by the appropriate sum(s).
barplot(): Creates a bar plot with vertical or horizontal bars.
hist(): Computes a histogram of the given data values.
stem(): Produces a stem-and-leaf plot of the values in x.
plot(): Draw a scatter plot with decorations such as axes and titles in the active graphics window.

0.4.9 Now It’s Your Turn!

If you read an entire book on the theory of how to properly shoot a basketball, would that improve your ability to make a free throw? The answer is probably not unless you actually took time to also practice the concepts you are learning. Now there are exercises at the end of every chapter (everyone loves homework), but the yellow Now It’s Your Turn! boxes, the one one below from Section 3.2.2, offer a low stakes way for students to check if they are grasping the material as they go.

Make a comparative bar chart of the number of gears a car had based on the shape of the engine. The two variables are gear which gives the number of forward gears a car had and vs which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside argument. Can you see a relationship between the variables?

0.4.10 Platypus Oddity

The platypus is weird… there’s no way to get around that. However, so are most mathematicians. We celebrate the uniqueness of the platypus with a gray Platypus Oddity box at the beginning of each chapter immediately after sharing an image of a Statypus (a statistics loving platypus) entirely to bring humor and happiness to the reader! The following fact does not appear in any chapter, but tucked away here for the most invested reader.

Platypuses are five times as sexy as humans. Well, at least “chromosomally.” A platypus has ten sex chromosomes while a human has only two.

Male platypuses have the pattern

\[\text{XYXYXYXYXY}\]

while females have the pattern

\[\text{XXXXXXXXXX}.\]

0.4.11 Definition

It is “turtles all the way down” as the saying goes. To make any headway in mathematics or statistics, we must begin with defining certain things and the green Definition boxes, like the one below for Definition 4.1, do just that. While not as fun as a fun fact about a leg laying mammal, definitions cannot be left out. Some definitions here may not match ones you may have learned in the past and that is fine. It is up to the author of each book to define what terms mean within the pages (webpages, I guess) of their book.

Definition 0.1 Given a vector \({\bf x} = ( x_1, x_2, \ldots, x_n )\) having \(n\) values, we can define the arithmetic mean or simply mean of \({\bf x}\), which we denote as \(\bar{x}\) if \({\bf x}\) is a sample or \(\mu\) if \({\bf x}\) is the entire population, as follows. \[\bar{x} \text{ or }\mu= \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \left( x_1 + x_2 + \cdots + x_n \right) = \frac{ x_1 + x_2 + \cdots + x_n }{n}.\]

0.4.12 Remark

Sometimes a point needs to be made and stand out, but it’s not a potential mistake nor does it fit into any of the other categories of boxes we have here at Statypus. Orange Remark boxes like the one below from Section 9.3.2 fill this gap. They will contain a wide array of ideas and concepts that students should pay attention to.

The following definition is an interpretation of the “informal” definition of a \(p\)-value as given by the American Statistical Association⁶. Unfortunately, a rigorous definition is not easily given, nor is its interpretation fully agreed upon.

0.4.13 Theorem

“Mathematicians turn coffee into theorems” is an old adage and isn’t necessarily untrue, although the author of this book prefers tea! The theorems of this book will appear in bright green Theorem boxes like the one below for Theorem 5.1. If it’s a theorem, it’s probably important.

Theorem 0.1 The correlation, \(r\), between two vectors, \({\bf x}\), and \({\bf y}\), satisfies the following:

\(-1 \leq r \leq 1\)
\(|r| =1\) means that the ordered pairs \(\big\{ (x_i, y_i)\big\}\) are collinear.
Correlation is a symmetric operation. That is, the correlation of \({\bf x}\) and \({\bf y}\) is the same as the the correlation of \({\bf y}\) and \({\bf x}\), i.e. the order of the vectors does not matter.

0.5 A Note From the Author

First off, most people will interact with this as a website, but for simplicity’s sake, we may refer to this collection of webpages as a “book.” We may refer to it as “this book,” “Statypus,” or something similar, but any reference to something outside of this document will be given as explicitly as is possible.

This book is the culmination of a project to overhaul the STAT 1300 course at Saint Louis University (SLU). In 2011, the math department elected to require the course be taught using the statistical software known as R. Prior to this, they had used SPSS going back to Version 1 in the 1970s. Members of the math department, with consultation to other departments at the university, decided to make the shift as R began to attract more use among researchers as well as academics. The open source nature of R (and RStudio) removes any economical barrier that commercial software may have and this was also seen as a huge benefit for all parties.

However, after a few years the course was being taught nearly entirely by adjuncts and a question as to the quality and consistency of the course was raised, especially with how R was integrated into the course. After an internal review, it was decided that an overhaul of the course was needed and materials to allow a high quality and consistent level of instruction using R needed to be developed.

Initially, the hope was to find a traditional commercial product to accomplish this goal. There was the expectation that some sort of scaffolded materials to incorporate R may be needed with a traditional textbook, but there was no plan to “start from scratch.” Dozens of introductory statistics books were reviewed from every publisher who could be thought of. However, there was no introductory (non-Calculus based) statistics book that really aligned itself to using R.

There are many outstanding commercial introductory statistics books out there, but most commercial products are designed to reach as many people or universities as they can. A lot of books will walk through the theory and calculations software agnostic and then have examples of how to handle those concepts within different types of technology. After trying to find the right angle to structure this course through trying different ideas and methods, I began to realize that I needed to provide written instruction about how to use R to my students.

Using R had become absolutely interwoven with how I was teaching the course, and I was spending countless classroom hours showing students code in R. I had begun to share the R scripts I created during lectures with students and this worked fairly well, but something more seemed necessary. In the Spring semester of 2024, I adapted a portion of the book Probability, Statistics, and Data by Darrin Speegle and Bryan Clair into a document I called “SimulationProject.” With this PDF, I provided my students with examples of R code which they could copy and paste and use to work problems on their own. Seeing R code in a “published” format seemed to be revolutionary to my students.

That led me to the conclusion I should be providing professional looking examples of R code to my students if if I was going to expect them to be able to use it at the appropriate level. That appropriate level may be different for different people, but my vision is that students in this course should be expect to use and slightly modify code they are given. It is not expected that students will be able to create new novel code and that code templates and also worked examples were necessary for students to know how to really use R. The next document to develop was called “SRSwSample.” A lot of introductory books discuss the antiquated practice of using random digit tables to find a Simple Random Sample. However, with the course being taught through R, it seemed silly to not leverage the software to do the sampling for us. This was the first document that I created completely from scratch and student’s liked the ability to review a finished PDF rather than having to simply rely on their ability to take notes on examples of how I used R functions in class.

From here, it was “Game On” and in the Fall semester of 2024 I began to create stand alone documents to supplement how to use R to do the concepts in each chapter of the textbook we were using. In addition to seeing me work problems in class, students had typed out examples of the use of R for the concepts we were discussing in class. Approximately 10 documents were developed over the course of that semester and the total page count was at over 200 pages. What started as a simple “add-on” was turning into a full workbook or possibly even a textbook.

With the guidance of other faculty member’s I decided to combine all of these documents into a single location and it was decided that the easiest way to disseminate that collection to our students was via a website. The goal was to create a workbook to exist in tandem with an existing commercial textbook that would allow the commercial textbook to handle the heavy lifting of setting up all of the content while allowing my website to offer students a place to learn how to integrate those concepts in R. Even during the opening week of the Spring 2025 semester I told the publisher of the commercial book we were using that I had no intentions of working without a traditional textbook for this course.

However, dancing with two partners is not easy. It became clear that it was not advantageous to work with students in one book and then pivot to another book to learn “how to do that concept” and then have to go back to the original book for the next concept. In addition, no commercial book has terminology and notation which is consistent with that found in R, so some “translation” became necessary. Translation became more and more teaching and most of the materials developed later in the Fall semester of 2024 was nearly a stand-alone textbook for the material it was teaching.

While laborious to write your own stand-alone textbook, it does allow you to tailor it to the exact specifications that you wish. There was no existing commercial book that fit the course that SLU envisioned, so it started to become clear what this project needed to do: write a whole new textbook with the exact vision that the course had. I want my students to interact with statistics and data in a modern way. That is, they should always expect to be able to have computational power nearby, be it a laptop or just their smartphone. The idea of doing anything statistical without technology seems almost laughable in the modern world.

This book attempts to teach statistics using computer as a tool rather than a stumbling block. We can leverage the power of computational tools to free our shoulders the burden of memorizing intricate formulas which offer often very little to the ideas they are trying to convey. I find it much more important that a student can find a \(p\)-value using something such as t.test and interpret it correctly than to know the formulas below:

\[t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2} {\frac{\frac{s_1^2}{n_1}}{n_1 -1} + \frac{\frac{s_2^2}{n_2}}{n_2-1}} \] From here, simple black and white pages began to see the addition of color, graphics, interactive explorations, and even some platypus flavoring here and there.

For those wondering, “Why a platypus?” I will tell you a story. My daughter, Amelia, was born in April of 2023 and has been an amazing addition to the world since her birth. When she was little and would need a new diaper, Amelia would fuss unless you spoke to her. It didn’t matter what you said, she was only a few weeks old after all, but she wanted to hear you talk. One day, while changing her diaper, I didn’t know what to say, but I knew that if I talked, it would soothe her. For some reason, I asked Amelia: “Did you know that a platypus is furry and has a tail like a beaver, but that it also has a bill like a duck?” Amelia instantly calmed down and locked eyes with me. I continued to rattle on a few other platypus facts I happened to know and she stared at me transfixed. That began a minor obsession with the unique mammal from Australia and Amelia even began being called “Platypus Baby” or PB for short. In fact, to this day, Amelia responds to and refers to herself as PB.

Figure 0.3: Amelia, the Platypus Baby

I hope that this labor of love aids you in either your efforts to learn or teach statistics. Please reach out to me if you have any questions or suggestions as you venture down your own personal path of discovery.

Dr. Phil Huling

28 February, 2025

https://platypus.asn.au/platypus-body-temperature/↩︎
If you really must know the solution to this problem, you can look here.↩︎
See this article for more information and further discussion about the difficulty in using if not just defining a \(p\)-value.↩︎