Chapter 3 Using Graphs to Understand Data

Figure 3.1: A Baby Statypus Learns to Write their Name
The platypus is a mammal that lays eggs. Yes, that is a thing. The platypus and the echidna are the only mammals that lay eggs.24
Raw data is often difficult to glean any useful information from. The concept of variability is central to sound statistical thinking and it is our goal to be able to understand how values in a population vary. One of the simplest tools available to someone working with data is to draw a picture. In this chapter, we will learn how to use RStudio to generate different graphical displays of data such as bar plots, histograms, and scatter plots. Graphs are a literal picture of the variability of data and are a fundamental way that most people interact with data. Advanced users can quickly glean information about a distribution by looking at an appropriate graph and see patterns that may be difficult to see numerically. This is our first reason why it is important to be able to leverage the power of the R software.
New R Functions Used
All functions listed below have a help file built into RStudio. Most of these functions have options which are not fully covered or not mentioned in this chapter. To access more information about other functionality and uses for these functions, use the ?
command. E.g. To see the help file for ?barplot
, you can run ?barplot
or barplot()
in either an R Script or in your Console.
table()
: Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.proportions()
: Returns conditional proportions given entries ofx
divided by the appropriate sum(s).barplot()
: Creates a bar plot with vertical or horizontal bars.hist()
: Computes a histogram of the given data values.stem()
: Produces a stem-and-leaf plot of the values inx
.plot()
: Draw a scatter plot with decorations such as axes and titles in the active graphics window.
3.1 Graphing Qualitative Data
We recall that Qualitative Data differs from Quantitative Data most fundamentally because it is impossible, or illogical, to perform arithmetic operations on qualitative data. Thus, all we can really analyze is how frequent different values of the variable occur and we can visualize this with graphs.
3.1.1 Making Tables
Our first tool we will use is not technically a graph, but will be needed for some of our basic graphs. To get started, we create a simple example to analyze. The following code chunk creates a random vector of colors using the sample
function we saw in Chapter 2.
<- c( "Blue", "Orange", "Green", "Yellow", "Red", "Brown" )
colorsList <- sample( x = colorsList, size = 30, replace = TRUE )
colors colors
## [1] "Yellow" "Orange" "Brown" "Red" "Yellow" "Blue" "Red" "Brown" "Yellow" "Orange"
## [11] "Brown" "Orange" "Brown" "Brown" "Yellow" "Brown" "Brown" "Brown" "Yellow" "Yellow"
## [21] "Red" "Yellow" "Green" "Yellow" "Red" "Orange" "Red" "Orange" "Brown" "Green"
The above is the “raw” data and can be used in many ways. The simplest is to look at how many times each day was selected, or the frequency of each number. This will create the frequency table of our dataset. To do this, we introduce the table()
function.
The syntax of table
is
table( x, exclude, useNa )
where the parameters are:
exclude
: Can be used to exclude certain values of a vector from the table.useNa
: Used to determine whether to includeNA
values in the table. The choices are"no"
,"ifany"
, and"always"
.
We can now graph the vector colors
that we defined above. We will omit the exclude
and uneNA
parameters for now.
table( colors )
## colors
## Blue Brown Green Orange Red Yellow
## 1 9 2 5 5 8
This table gives each outcome with its frequency. It says that we selected BLUE
only once, Brown
a total of 9 times, and so on.
Example 3.1 We will not use the parameters above very often, the parameter useNA
can be used to control whether or not to include a column letting you know how many values were missing in the dataset. In most cases, useNA
is set to "no"
by default, To see how "ifany"
and "always"
differ, we first define a simple character vector, \(x\), as follows
<- c("a", "a", "a", "b", "b", NA) x
then we can get the table below.
table( x )
## x
## a b
## 3 2
This table omits the last NA
value. To see it in the table, we include either useNA = "ifany"
or useNA = "always"
.
table( x, useNA = "ifany" )
## x
## a b <NA>
## 3 2 1
To see the difference between "ifany"
and "always"
, we note that if we used "ifany"
for a colors
table we would not change the table we obtained above.
table( colors, useNA = "ifany")
## colors
## Blue Brown Green Orange Red Yellow
## 1 9 2 5 5 8
However, if we used "always"
for a rolls
table, we would get the following.
table( colors, useNA = "always")
## colors
## Blue Brown Green Orange Red Yellow <NA>
## 1 9 2 5 5 8 0
The parameter exclude
does exactly what it says it does, it excludes certain values. For example, we can exclude just the values of Yellow
from colors
using
table( colors, exclude = "Yellow" )
## colors
## Blue Brown Green Orange Red
## 1 9 2 5 5
or exclude all of the primary colors rolls with
table( colors, exclude = c( "Red", "Blue", "Yellow" ) )
## colors
## Brown Green Orange
## 9 2 5
Raw data often contains missing values that may not be labeled as NA
or even the same way across a dataset. It can be necessary to know how to implement certain parameters of functions to handle such cases. Here, we see that exclude
can be used to remove different types of “missing” values which may appear as NA
, NULL
, none
, or something else altogether.
Load the dataframe called Asylum1849
.
<- read.csv("https://statypus.org/files/Asylum1849.csv") Asylum1849
We saw this data frame in Chapter 2 where we mentioned that it contains intake information for the Meerenberg Insane Asylum in Brederodelaan, Santpoort, The Netherlands.
Look at str( Asylum1849 )
and look at the variables contained in the data. Which variables are qualitative? Which ones are sensible to be represented by a table? Make a table for the appropriate variables.
3.1.2 Bar Plots
Our main tool for graphing qualitative data is a bar plot (sometimes called a bar graph or bar chart). We can create a bar plot using the function barplot()
.
The syntax of barplot
is
barplot( height )
Remark. There are many options/parameters which can be used with graphical functions which can be used to make better looking graphs. The view of this document is that it is fairly easy to make R produce graphs, but making publishable level graphs may take considerable effort and/or skill.
To see the way barplot
will work, we will create a vecor called rolls
and attempt to make a barplot
of it. The following is a random collection of rolls of a six sided die.
<- c( 1, 4, 6, 1, 6, 6, 6, 6, 1, 6, 5, 3, 5, 2, 4, 2, 6, 6, 1, 6, 3, 5, 2, 5, 1 ) rolls
If we try to use barplot
directly with the vector rolls
, we get a very weird chart as we see below.
barplot( rolls )

Figure 3.2: A Bad Bar Plot
This is not a proper bar plot because each bar is actually relating to the value of each roll. Looking back at rolls
, the first five values were \(1,4,6,1, \text{and } 6\) which correspond exactly with the first five bars in the plot above. The barplot
function must take in a value known as height
which consists of a sequence of values in a vector. This is exactly what the table
function does. So, to use barplot
with raw data, we simply run table
on our raw data and pass that output into the barplot
function. We can now use barplot
to get a visualization of rolls
.
barplot( table( rolls ) )

Figure 3.3: A Better Bar Plot
This gives exactly the same information as table
does, as we show below, but does it in a graphical way.
table( rolls )
## rolls
## 1 2 3 4 5 6
## 5 3 2 2 4 9
Unless you are working with already tabulated data, that is data already in the form of a table, you will need to pair barplot
with table.
Remark. You may have noticed that rolls
is actually qualitative data. If we had attemped to use barplot
on raw qualitative data, it would throw an error. Feel free to try and run barplot
on colors
without pairing it with table
.
If we wanted to get the proportion, or relative frequency, of each roll and not the frequency, we can inject the R function proportions()
as below:
barplot( proportions( table( rolls ) ) )

Figure 3.4: Relative Frequency Bar Plot
Remark.
- We still need to include
table
to get the relative frequency bar plot. - The graphical output of a frequency and relative frequency bar plot are identical except for the labels on the vertical axis.
- The
proportions
function can be used withtable
with or without thebarplot
function. The composition function,proportions( table( x ) )
will produce a relative frequency table.
We summarize this with the following:
If you have a vector, x
, or column of a dataframe, df$Col
, of values of a qualitative variable, we can make a bar plot of the frequencies it via:
barplot( table( x ) ) #Will only run if x is defined
or
barplot( table( df$Col ) ) #Will only run if df is appropriately defined
If we want relative frequencies instead, we use the following code:
barplot( proportions( table( x ) ) ) #Will only run if x is defined
or
barplot( proportions( table( df$Col ) ) ) #Will only run if df is appropriately defined
Example 3.2 (Number of Carburetors) Let’s look at the built in dataset mtcars
. Use ?mtcars
for more information about this dataset.
head( mtcars )
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
and look at the column mtcars$carb
which gives the number of carburetors of each car in the dataframe. We can create a frequency bar plot for this variable with:
barplot( table(mtcars$carb ) )

Figure 3.5: Frequency Bar Plot of mtcars$carb
or a relative frequency bar plot via:
barplot( proportions( table( mtcars$carb ) ) )

Figure 3.6: Relative Frequency Bar Plot of mtcars$carb
Remark. We could also use the following three lines of code in which the second and third lines are directly copied from above.
<- mtcars$carb
x barplot( table( x ) )
barplot( proportions( table( x ) ) )
This method of defining x
and then using the stock code may be preferred for some readers, however, at this point we will be attempting to give single lines of code that work.
If you have not installed the package HistData
, refer to Section 1.8 and do so. Load the package using the following line of code or selecting the check box next to HistData
in the Packages tab of the lower right pane of RStudio.
library(HistData)
##
## Attaching package: 'HistData'
## The following object is masked from 'package:BSDA':
##
## Wheat
Look at dataset DrinkWages
and look at ?DrinkWages
to learn a bit about the data. Make a bar plot showing the frequency of each wage class in the dataframe. Write full sentences explaining what the bar plot shows.
Definition 3.1 A Pareto Chart is a bar plot where the bars are ordered in descending order from left to right. A Pareto Chart can be be stated in terms of either observed frequency or relative frequency.
To get a Pareto chart of a qualitative variable, use the following code.
barplot( sort( table( x ), decreasing = TRUE ) )
#Will only run if x is defined
or
barplot( sort( table( df$Col ), decreasing = TRUE ) )
#Will only run if df is appropriately defined
You can also include the function proportions
in between sort
and table
to make the Pareto chart display relative frequencies.
Example 3.3 We can use the Bulls1996
data frame we introduced in Example 2.16. We begin by offering the code to download it if you do not have it on your machine.
<- read.csv( "https://statypus.org/files/ChicagoBulls1996.csv") Bulls1996
We remind ourselves what data is contained in Bulls1996
by running str
on it.
str( Bulls1996 )
## 'data.frame': 15 obs. of 8 variables:
## $ No. : int 0 30 35 53 54 9 23 25 7 13 ...
## $ Player : chr "Randy Brown" "Jud Buechler" "Jason Caffey" "James Edwards" ...
## $ Pos : chr "PG" "SF" "PF" "C" ...
## $ Ht : int 74 78 80 84 82 78 78 75 82 86 ...
## $ Wt : int 190 220 255 225 240 185 198 175 192 265 ...
## $ Birth.Date: chr "May 22, 1968" "June 19, 1968" "June 12, 1973" "November 22, 1955" ...
## $ Exp : int 4 5 0 18 6 9 10 7 2 4 ...
## $ College : chr "Houston, New Mexico State" "Arizona" "Alabama" "Washington" ...
We can see give an example of a Pareto chart by using the Pos
variable which gives the primary position that that player played.
barplot( sort( table( Bulls1996$Pos ), decreasing = TRUE ))

Figure 3.7: Pareto Chart of Bulls1996$Pos
3.2 Comparing Qualitative Variables
Lets return to the mtcars
dataframe and see how the number of cylinders a car has is related to the type of transmission it has. Using ?mtcars
lets us know that a value of 0 in the am
column means that the car has an automatic transmission while a value of 1 indicates it is a manual transmission. We can basically redo the work in the Section 3.1 using the exact same functions.
3.2.1 Two Way Tables
To get to a Comparative Bar Plot, which we will introduce in Section 3.2.2, we first look at a two way table using table
again. Using ?table
and looking at the Arguments
section shows that the main input, listed as ...
, is “one or more objects which can be interpreted as factors.” We can thus pass two difference columns of mtcars
to table as below:
table( mtcars$am , mtcars$cyl )
##
## 4 6 8
## 0 3 4 12
## 1 8 3 2
This table says that there are 3 cars which have an automatic transmission as well as 4 cylinders, 4 cars which have an automatic transmission as well as 6 cylinders, and so on.
We could also reverse the order of the columns in table
to get:
table( mtcars$cyl , mtcars$am )
##
## 0 1
## 4 3 8
## 6 4 3
## 8 12 2
This shows us that the entries in table
behave like the options used with a dataframe via df[ , ]
. The first entry controls the columns and the second entry controls the columns. Making a good graph may require the use to test which order produces the most understandable graph.
Refer back to the “Now it’s your turn!” example from Section 3.1.1 and use the code there to load the dataframe called Asylum1849
if necessary. Choose a couple of pairs of qualitative variables and construct two way tables to investigate the relationship between the variables. Can you see any patterns?
Remark. We will not be able to make an explicit decision if two variables are related until we get the machinery introduced in Chapter 11. For now, we can simply comment on patterns that appear to exist based on the tables we are able to create.
3.2.2 Comparative Bar Plot
Now that we are comfortable creating two way tables, we can proceed to visualizations of them. One way to visualize two way tables is to use a comparative bar plot. We can do this with the barplot
function.
The new syntax for barplot
is
barplot( height, beside, legend.text )
where the parameters we use here are
height
: Now the output of a two way table.beside
: IfFALSE
orF
, the columns of height are portrayed as stacked bars, and ifTRUE
orT
the columns are portrayed as juxtaposed bars.legend.text
: Used to give more information to make our graphs more understandable.
Like before, barplot
needs to be fed tabulated values, so you will need to also use table
if working with raw data.
Returning to the example we saw in Section 3.2.1, we can use barplot
to achieve the following.
barplot( table( mtcars$cyl , mtcars$am ) , beside = TRUE , legend.text = TRUE )

Figure 3.8: Comparative Bar Plot of cyl
and am
from mtcars
Version 1
As mentioned before, we could reverse the roles of cyl
and am
to get an different way of displaying the same information.
barplot( table( mtcars$am , mtcars$cyl ), beside = TRUE, legend.text = TRUE )

Figure 3.9: Comparative Bar Plot of cyl
and am
from mtcars
Version 2
Here us are using beside = TRUE
to make sure the bars are side by side, but we also show what beside = FALSE
looks like below.
barplot( table( mtcars$am, mtcars$cyl ), beside = FALSE, legend.text = TRUE )

Figure 3.10: Comparative Bar Plot of cyl
and am
from mtcars
Version 3
Remark.
It is important to remember that although the graphs may appear different when we change the parameters of
barplot
, the actual data has not changed and the differences are purely aesthetic. It is the job of the reader to always be a critical consumer and make sure they understand the graph they are viewing.As mentioned before, there are plenty of options available to make the above graphs better and avoid things like the awkward legend position above. However, this document will focus on trying to produce simple graphs.
We summarize the work of this section as well as Section 3.2.1 with the following code templates.
To get a Two Way Table of counts based on the columns Col1
and Col2
of a dataframe, df
we use the code:
table( df$Col1, df$Col2 ) #Will only run if df is appropriately defined
To get a Comparative Bar Plot based on counts based on the columns Col1
and Col2
of a dataframe, df
we use the code:
barplot( table( df$Col1, df$Col2 ), beside = TRUE, legend.text = TRUE )
#Will only run if df is appropriately defined
Make a comparative bar plot of the number of gears a car had based on the shape of the engine. The two variables are gear
which gives the number of forward gears a car had and vs
which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside
parameter. Can you see a relationship between the variables?
3.3 Graphing Quantitative Data
Quantitative data differs from qualitative data in the sense that it is always at the interval level and thus we can separate data into equally sized groups or “bins.” Doing so allows us to plot the frequency of not just individual values, but intervals of values that are close. For example, we may group all cars that have between 20 and 25 miles per gallon fuel efficiency. This type of graph is called a histogram and can be used for both continuous and discrete data types, see Section 2.2.4 for the details on the distinction.
3.3.1 Histograms for Continuous Data
We will investigate using a histogram for continuous data by using a new variable in mtcars
. The variable mpg
in the mtcars
dataframe gives the fuel efficiency of each car in miles per gallon. We can make a histograms using the R function hist()
.
The syntax of hist
is
hist( x, right, breaks, freq )
where the parameters are:
right
: By default,right = TRUE
the histogram cells are intervals of the form \((a, b]\), i.e., they include their right-hand endpoint, but not their left one. If the user specifies thatright = FALSE
, the intervals are of the form \([a, b)\) which is common in many other introductory statistics textbooks.breaks
: This controls the number of bins being used in the histogram. The simplest way to control this is by setting the parameter equal to an integer which R will use to give the number of equal spaced breaks used.freq
: (Usually) By default,freq = TRUE
and the histogram graphic is a representation of frequencies. If the user specifiesfreq = FALSE
, relative frequencies are used.
Remark.
R attempts to give the best histogram it can by default, but the user can control nearly every aspect of it.
R may not use the exact number of cells you specify in
breaks
and there are more direct tools to have more control, but this document does not go into those other than the ones in Section 3.3.2.
We can now make a histogram of the fuel efficiency of cars in the mtcars
dataframe by using the following.
hist( mtcars$mpg )

Figure 3.11: Histogram of mtcars$mpg
From here we can see that there were 8 vehicles that had a fuel efficiency between 20 and 25 miles per gallon. To be precise, there were 8 vehicles in mtcars
whose value in the column mpg
was in the interval \((20,25]\). If a car had had a fuel efficiency of exactly 20 mpg, it would occur in the bin given by \((15,20]\). As mentioned in the remark above, it is possible to use the breaks
option to control the values of the bins directly, but the use of a package such as ggplot2
is likely suggested if someone is wanting to create graphs that deviate too far from the default graphics produced by R.
It is important to remember that any differences in plots produced are completely attributable to the selection of the parameters and do not mean there has been any “change” in the data itself. The manipulation of graphs is often done to make a graph tell the story that the creator wants it to. There is the famous quote which Mark Twain attributed to the British Prime Minister Benjamin Disraeli in Chapters from My Autobiography, published in the North American Review in 1907:
“There are three kinds of lies: lies, damned lies, and statistics.”25
While slightly allegorical, it does point out that statistics can be used to deceive people with the subtleness needed to fully understand statistical graphs and calculations.
Example 3.4 To get a feel for what the breaks
parameter does, we run a few examples of histograms with different values of breaks
. To do this, we introduce a new dataset called PlatypusData1
.
<- read.csv( "https://statypus.org/files/PlatypusData1.csv" ) PlatypusData1
As this is new data, we begin by looking at it with the functions str
and head
.
str( PlatypusData1 )
## 'data.frame': 218 obs. of 4 variables:
## $ Weight : num 1.12 0.94 NA 1.18 1.09 0.7 1.55 1 0.52 1.4 ...
## $ Sex : chr "M" "F" "F" "M" ...
## $ AgeClass : chr "SA" "A" "A" "A" ...
## $ concentration: num 2.1 3.26 8.38 2.5 2.14 3.96 16.4 5.6 9.6 4.44 ...
This tells us that PlatypusData1
is a dataframe that consists of 218 observations, or individuals, each of which has 4 variables recorded for it. The variable Sex
is a character string and we can see the possible values it has by running table
.
table( PlatypusData1$Sex )
##
## F M
## 1 109 108
This shows that our dataset contains observations of 109 females and 108 males as well as a single observation where the sex is not recorded. The variable AgeClass
is also of the character type and contains the age of the platypus as shown below.
table( PlatypusData1$AgeClass )
##
## A J SA
## 1 167 46 4
Platypuses are not easily bred in captivity, and as such, exact ages are rarely known. As such, this data has classified the platypuses into either Juvenile (J
), Adult (A
), or Senior Adult (SA
). We also see that there is a single observation does not have a recorded value for this variable. The interested reader is welcome to examine the article that the data originates from26, but we warn you that the article is quite scientifically dense. The variable concentration
deals with the concentration of genomic DNA in a blood sample. We will examine and further discuss this variable later.
We will be focusing on the variable Weight
which gives the mass27 of each platypus.
To see what hist
will do by default, we simply run the following code.
hist( PlatypusData1$Weight )

Figure 3.12: Histogram with Default breaks
Here we see that R uses 11 bins when making a histogram for PlatypusData1$Weight
. We can try to force R to use only a single bin by setting breaks = 1
:
hist( PlatypusData1$Weight, breaks = 1 )

Figure 3.13: Histogram with breaks = 1
Instead of a single bin, we get a single “break” and two different bins. To see if this is indeed how breaks
work, we proceed to produce histograms with breaks = 2
and breaks = 3
.
hist( PlatypusData1$Weight, breaks = 2 )

Figure 3.14: Histogram with breaks = 2
hist( PlatypusData1$Weight, breaks = 3 )

Figure 3.15: Histogram with breaks = 3
The histogram using breaks = 2
worked as we anticipated, but the histogram with breaks = 3
had 4 breaks and 5 bins. These examples show one thing: setting breaks
to an integer asks R to use that many bins, but R will choose the number of bins it feels is best that is near the value we see.
Investigating further, we can also increase the number of bins with
hist( PlatypusData1$Weight, breaks = 10 )

Figure 3.16: Histogram with breaks = 10
or even use more bins to get:
hist( PlatypusData1$Weight, breaks = 20 )

Figure 3.17: Histogram with breaks = 20
In all cases, the number of bins is near the value of breaks
but fails to match in many cases. These examples reinforce our earlier remark that if the user wants more refined control, they will need to invest energy into learning a bit more coding or use advanced packages such as ggplot2
.
Example 3.5 To see what freq
does, we recall what the default histogram of mtcars$mpg
looked like:
hist( mtcars$mpg )

Figure 3.18: Frequency Histogram
We now can set freq = FALSE
(or simply freq = F
) to get:
hist( mtcars$mpg, freq = FALSE )

Figure 3.19: Relative Frequency Histogram
This shows that freq = FALSE
gives a Relative Frequency Histogram. Setting freq = TRUE
or simply omitting the freq
parameter (because TRUE
is the default value) gives a standard Frequency Histogram. Like bar plots, the shape of a histogram does not change when we move from frequencies to relative frequencies, only the values along the vertical axis do.
To make a basic histogram of a vector x
, you can use the code:
hist( x, freq ) #Will only run if x is defined
or if we have a column of a dataframe, we would use:
hist( df$Col, freq ) #Will only run if df is appropriately defined
Where we set freq = FALSE
if we want relative frequencies rather than just frequencies. We can set freq = TRUE
or omit freq
if we want regular frequencies.
Example 3.6 (Working With Only Part of a Dataframe) Sometimes, we want to look at only a portion of a dataset. For example, if we want to make a histogram of the masses of just the male platypuses, we need to find just the portion of PlatypusData1
, introduced and imported in Example 3.4, that contains this data. Recall that we can find a subset of PlatypusData1
by using PlatypusData1[ R , C ]
where R
is a rule used to choose the columns we want and C
is a rule for the columns we want to use. For R
we will use the command PlatypusData1$Sex == "M"
which will effectively select only those rows where the platypus is male. If we look at PlatypusData1[ PlatypusData1$Sex == "M" , ]
we get the following:
head( PlatypusData1[ PlatypusData1$Sex == "M" , ] )
## Weight Sex AgeClass concentration
## 1 1.12 M SA 2.10
## 4 1.18 M A 2.50
## 5 1.09 M A 2.14
## 7 1.55 M A 16.40
## 9 0.52 M J 9.60
## 10 1.40 M A 4.44
We have successfully selected just the rows of male platypuses. There are a couple of ways of progressing from here.
If we planned on doing more analysis of the male platypuses, it may be helpful to create a new dataframe of just these cars. We could do this via
<- PlatypusData1[ PlatypusData1$Sex == "M", ] MalePlatys
If you now view MalePlatys
you should see the same output as when we ran PlatypusData1[ PlatypusData1$Sex == "M" , ]
. However, we can now simply pull the column Weight
out of this dataframe by running the code below.
$Weight MalePlatys
## [1] 1.12 1.18 1.09 1.55 0.52 1.40 1.47 1.37 1.27 1.45 0.47 1.33 1.76 0.66 1.50 1.30 1.33 1.47 1.53
## [20] 0.91 1.25 1.58 1.67 0.64 1.11 1.29 1.42 0.90 1.48 0.76 NA 1.53 1.20 1.52 1.48 1.60 1.36 1.56
## [39] NA 1.59 1.23 NA 1.10 1.21 1.62 1.68 0.90 1.79 1.10 1.43 1.47 1.64 0.88 1.69 NA NA 1.80
## [58] 1.83 1.84 1.74 1.45 1.15 NA 1.77 1.05 1.80 2.01 1.60 1.07 NA 0.84 2.11 1.15 1.73 1.84 1.58
## [77] 1.03 0.82 2.08 1.55 1.66 1.39 1.64 1.86 1.84 0.11 0.85 1.51 0.93 1.79 1.50 1.18 1.46 1.28 1.36
## [96] 1.56 0.76 1.02 1.99 1.58 0.82 1.64 0.90 0.93 1.34 1.36 1.14 1.68
Since we now have a vector, we can now run:
hist( MalePlatys$Weight )

Figure 3.20: Frequency Histogram for Male Platypuses, Method 1
However, if we had no desire to investigate MalePlatys
further, it may not be beneficial to create a new dataframe. We can pull just the horsepower values by using the following code:
$Sex == "M", "Weight" ] PlatypusData1[ PlatypusData1
## [1] 1.12 1.18 1.09 1.55 0.52 1.40 1.47 1.37 1.27 1.45 0.47 1.33 1.76 0.66 1.50 1.30 1.33 1.47 1.53
## [20] 0.91 1.25 1.58 1.67 0.64 1.11 1.29 1.42 0.90 1.48 0.76 NA 1.53 1.20 1.52 1.48 1.60 1.36 1.56
## [39] NA 1.59 1.23 NA 1.10 1.21 1.62 1.68 0.90 1.79 1.10 1.43 1.47 1.64 0.88 1.69 NA NA 1.80
## [58] 1.83 1.84 1.74 1.45 1.15 NA 1.77 1.05 1.80 2.01 1.60 1.07 NA 0.84 2.11 1.15 1.73 1.84 1.58
## [77] 1.03 0.82 2.08 1.55 1.66 1.39 1.64 1.86 1.84 0.11 0.85 1.51 0.93 1.79 1.50 1.18 1.46 1.28 1.36
## [96] 1.56 0.76 1.02 1.99 1.58 0.82 1.64 0.90 0.93 1.34 1.36 1.14 1.68
We can then give a histogram of this by putting it into hist
via:
hist( PlatypusData1[ PlatypusData1$Sex == "M", "Weight" ] )

Figure 3.21: Frequency Histogram for Male Platypuses, Method 2
Remark. There are many ways to accomplish the above task with R. For example PlatypusData1$Weight[ PlatypusData1$Sex == "M" ]
is equivalent to PlatypusData1[ PlatypusData1$Sex == "M", "Weight" ]
, but we recommend the latter form as it allows for easier generalization if we wanted more than just one column. For example, the following code gives (the head of) a dataframe containing both the mass and age class of male platypuses.
head( PlatypusData1[ PlatypusData1$Sex == "M", c( "Weight", "AgeClass" ) ] )
## Weight AgeClass
## 1 1.12 SA
## 4 1.18 A
## 5 1.09 A
## 7 1.55 A
## 9 0.52 J
## 10 1.40 A
The output is a new dataframe consisting of the portion of PlatypusData1
with only the rows where Sex
is "M"
and only the columns Weight
and AgeClass
.
You can also use the code PlatypusData1[ PlatypusData1$Sex == "M", ]$Weight
and a few other variations. There is often not a single way to accomplish your goal in R and we encourage the reader to try things and see what they do.
3.3.2 Histograms for Discrete Data
When graphing continuous data, R typically does a fine job choosing the bins to split the data into. However, the hist
function typically does not do a great job at making graphs of discrete data. For example, consider the following vector of dice rolls:
<- c( 6, 1, 2, 2, 2, 3, 2, 5, 4, 4, 3, 6, 5, 4, 1 ) x
We can tabulate these to see what a histogram should look like.
table( x )
## x
## 1 2 3 4 5 6
## 2 4 2 3 2 2
If we use hist
to plot this, we get the following:
hist( x )

Figure 3.22: Bad Histogram of Dice Rolls Version 1
It appears that the counts for 1 and 2 have been combined. We could try the option of right = FALSE
to solve this, but that gives the following:
hist( x, right = FALSE )

Figure 3.23: Bad Histogram of Dice Rolls Version 2
While this did fix the issue we had with 1s and 2s combining, we now get that the 5s and 6s have been combined. A more traditional way to display discrete data with integer values is to have bins offset by one half. That is, we want the bin that is used for the value of 3, for example, to be from 2.5 to 3.5. We will illustrate this via a larger dataset.
Example 3.7 We begin by loading in the file called BabyData1
. This file contains data about the births of a sample of babies.
<- read.csv( "https://statypus.org/files/BabyData1.csv" ) BabyData1
As mentioned in Chapter 1, we should begin our investigation BabyData1
by using str
and head
.
str( BabyData1 )
## 'data.frame': 200 obs. of 12 variables:
## $ mom_age : int 35 22 35 23 23 26 25 32 41 22 ...
## $ dad_age : int 35 21 42 NA 28 31 37 38 39 24 ...
## $ mom_educ : int 6 3 4 1 4 3 5 5 4 3 ...
## $ mom_marital : int 1 1 1 1 1 2 1 1 1 2 ...
## $ numlive : int 2 1 0 2 0 1 0 1 0 0 ...
## $ dobmm : int 2 3 6 8 9 10 7 12 11 2 ...
## $ gestation : int 39 42 39 40 42 39 38 38 36 40 ...
## $ sex : chr "F" "F" "F" "F" ...
## $ weight : int 3175 3884 3030 3629 3481 3374 2693 4338 2834 2948 ...
## $ prenatalstart: int 1 2 2 1 2 4 1 1 2 1 ...
## $ orig_id : int 1047483 1468100 2260016 3583052 795674 3544316 3726920 2606970 2481971 243759 ...
## $ preemie : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
This shows us that BabyData1
is a dataframe and contains variables such as the age of the mother and the sex of the baby born. As the dataframe has 200 rows (or observations as it is listed in str
), it is helpful to look at only a portion of the data using head
now.
head( BabyData1 )
## mom_age dad_age mom_educ mom_marital numlive dobmm gestation sex weight prenatalstart orig_id
## 1 35 35 6 1 2 2 39 F 3175 1 1047483
## 2 22 21 3 1 1 3 42 F 3884 2 1468100
## 3 35 42 4 1 0 6 39 F 3030 2 2260016
## 4 23 NA 1 1 2 8 40 F 3629 1 3583052
## 5 23 28 4 1 0 9 42 F 3481 2 795674
## 6 26 31 3 2 1 10 39 M 3374 4 3544316
## preemie
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
This shows us a portion of BabyData1
but not in a way that would scroll our screen to lose sight of the information contained in the column names and labels. We will look at the data in the variable mom_age
which lists the mother’s age at the time of the baby’s birth. We see that it is of type int
which means we only have integer values in this variable. As such, to avoid the issues we had with the dice rolls, we will need to be much more explicit in how we setup our hist
function.
The following code will give a histogram of BabyData1$mom_age
where we want each bar of our histogram to include only 1 value.
<- BabyData1$mom_age #Entered by the User
data <- 1 #Entered by the User
bin_width #Do not change any values below this comment
hist(
data, freq = TRUE,
breaks = seq(
from = min( data, na.rm = TRUE ) - 0.5,
to = max( data, na.rm = TRUE ) + bin_width - 0.5,
by = bin_width
) )

Figure 3.24: Histogram of mom_age
Version 1
If we wanted to smooth our the histogram by grouping the ages into bins of width 3, for example we could combine the ages 30, 31 and 32 into a single bin, we can easily accomplish this by setting the value of bin_width
to be the number of values to group together. Below we set bin_width
to be 3 and get a new histogram.
<- BabyData1$mom_age #Entered by the User
data <- 3 #Entered by the User
bin_width #Do not change any values below this comment
hist(
data, freq = TRUE,
breaks = seq(
from = min( data, na.rm = TRUE ) - 0.5,
to = max( data, na.rm = TRUE ) + bin_width - 0.5,
by = bin_width
) )

Figure 3.25: Histogram of mom_age
Version 2
To plot a histogram of discrete data, we can use the following code chunk. The user must supply the discrete data they wish to graph in the place of x
and also the width of the bins they would like to use in as an integer in the place of w
.
<- x #Data must be supplied
data <- w #Value must be given
bin_width #Do not change any values below this comment
hist(
data, freq = TRUE,
breaks = seq(
from = min( data, na.rm = TRUE ) - 0.5,
to = max( data, na.rm = TRUE ) + bin_width - 0.5,
by = bin_width
) )
Using the x
vector from the beginning of Section 3.3.2, create a histogram that correctly displays the data.
3.3.3 Stem and Leaf Plots
To make a Stem and Leaf Plot we simply use the R function stem()
. Stem and leaf plots are a fairly unique way to graph data, so we will introduce it via example.
The syntax of table
is
stem( x, scale, width )
where the parameters are:
scale
: This controls the plot length.width
: The desired width of plot.
Example 3.8 We begin by looking at the variable qsec
in mtcars
which measures the quarter mile time of the car in seconds. We first look at the data and then a histogram of it.
$qsec mtcars
## [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40 17.60 18.00 17.98 17.82
## [17] 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
hist( mtcars$qsec )

Figure 3.26: Histogram of Quarter Mile Times
We now implement the stem()
function.
stem( mtcars$qsec )
##
## The decimal point is at the |
##
## 14 | 56
## 15 | 458
## 16 | 5799
## 17 | 00134468
## 18 | 00356699
## 19 | 459
## 20 | 002
## 21 |
## 22 | 9
If you were to turn the stem and leaf plot ninety degrees counterclockwise, we would see that these two graphs show nearly the same information. The histogram uses vertical rectangles to show the frequencies and the stem and leaf plot uses the leaf to mark each occurrence of a value with a specified stem.
The first value in mtcars$qsec
was 16.46. Looking at the stem and leaf plot we see the number 16 on the left side of the vertical lines. To the right of the | we see 5799. This means that our dataset contains values which round to 16.5, 16.7, 16.9, and 16.9. The 16 is known as the stem and the values 5, 7, 9, and 9 are the leaves of that stem. Leaves are always a single integer and are in numerical order of the value they represent. The stem of 21 is included regardless of the fact that it has no stems just as the histogram does not omit the portion of the number line containing 21.
Remark. The astute reader may notice a discrepancy between stem
and leaf
for the stems of 17 and 18 and the associated bins in the histogram of mtcars$qsec
. This is explored in Exercise 3.8.
Example 3.9 (Split Stems) R often will automatically make split stems when necessary as shown with mtcars$wt
which gives the vehicle’s weight in pounds.
stem( mtcars$wt )
##
## The decimal point is at the |
##
## 1 | 5689
## 2 | 123
## 2 | 56889
## 3 | 22224444
## 3 | 55667888
## 4 | 1
## 4 |
## 5 | 334
Here there are two stems with a value of 2. One of them has the leaves 0 through 4 and the other has the leaves 5 through 9.
We can also use the scale
parameter to manually split stems for some datasets. Setting scale = 2
will roughly double the length (number of stems) as compared to the default settings. We can apply this to mtcars$qsec
which had standard stems in Example 3.8.
stem( mtcars$qsec, scale = 2 )
##
## The decimal point is at the |
##
## 14 | 56
## 15 | 4
## 15 | 58
## 16 |
## 16 | 5799
## 17 | 001344
## 17 | 68
## 18 | 003
## 18 | 56699
## 19 | 4
## 19 | 59
## 20 | 002
## 20 |
## 21 |
## 21 |
## 22 |
## 22 | 9
Example 3.10 (Double Stems) R can also double stems as seen by looking at the variable we looked at before, mtcars$mpg
.
stem( mtcars$mpg )
##
## The decimal point is at the |
##
## 10 | 44
## 12 | 3
## 14 | 3702258
## 16 | 438
## 18 | 17227
## 20 | 00445
## 22 | 88
## 24 | 4
## 26 | 03
## 28 |
## 30 | 44
## 32 | 49
Looking at the stem of 18 we see values which are (when rounded) 18.1, 18.7, 19.2, 19.2, and 19.7. We know this because the values are always in order, so 18.2 cannot follow 18.7. Unfortunately, this can lead to some ambiguity as is the case with the 26 stem. The leaves of 0 and 3 could be either 26.0 and 26.3, 26.0 and 28.3, or 28.0 and 28.3. (28.0 and 26.3 would not be possible due to the stems having to be in ascending order of their data values.)
We can also use the scale
parameter to manually double stems for some datasets. Setting scale = 0.5
will roughly halve the length (number of stems) as compared to the default settings. We can apply this to mtcars$qsec
which had standard stems in Example 3.8.
stem( mtcars$qsec, scale = 0.5 )
##
## The decimal point is at the |
##
## 14 | 56458
## 16 | 579900134468
## 18 | 00356699459
## 20 | 002
## 22 | 9
Remark. Unfortunately, the scale
parameter is a bit like breaks
for histograms where setting breaks = 10
, for example, gave a histogram that had roughly 10 bins. You should only attempt to modify the scale
parameter after looking at the default plot and deciding if a modification is warranted.
Definition 3.2 A good stem and leaf plot should include a legend or key which helps us interpret the plot. R does this as we see that it states The decimal point is at the |
so that 22|9
can be quickly interpreted as 22.9.
Example 3.11 (Using a Stem and Leaf Legend) To see how to interpret this in general, we also look at mtcars$disp
.
$disp mtcars
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0
## [17] 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Knowing what values we are looking at, we can construct a stem and leaf plot.
stem( mtcars$disp )
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 7888
## 1 | 012224
## 1 | 556677
## 2 | 3
## 2 | 6888
## 3 | 002
## 3 | 5566
## 4 | 04
## 4 | 67
Knowing that the first value in the dataset is 160.0 we look for this value in the plot. The legend says The decimal point is 2 digit(s) to the right of the |
. This means that the stem of 1 relates to values between 100 and 199 and the leaves indicate the value in the tens place once rounded to the nearest ten, if needed. That is, we expect 160.0 to be shown as 1|6
and we indeed see a leaf of 6 after the (second) stem of 1. The max value in mtcars$disp
is 472.0 which would round to 470 which therefore appears as the pair 4|7
.
To make a stem and leaf plot of a vector x
, you use the code:
stem( x ) #Will only run if x is defined
or if we have a column of a dataframe, we would use:
stem( df$Col ) #Will only run if df is appropriately defined
Example 3.12 (The Width Parameter) The width
parameter defaults to 80 which is usually suitable for most plots. However, the user can define the width
parameter can be defined as to change the number of leaves we see on each row. For example, we can use the built in dataset ChickWeight
. We begin by looking at the structure of ChickWeight
.
str( ChickWeight )
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
We see that ChickWeight
has a very complex structure! We fill focus our attention to the weight
variable and look at a stem and leaf plot of it and setting width = 120
.
stem( ChickWeight$weight, width = 120 )
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 599999999
## 4 | 000001111111111111111111122222222222222233334566788888888999999999999900000011111111222233333444555556666777
## 6 | 0011111112222222233333444445555566667777788888890011111122222233333444444446667778889999
## 8 | 00112223344444455555566777788999990001223333566666788888889
## 10 | 0000111122233333334566667778889901122223445555667789
## 12 | 00002223333344445555667788890113444555566788889
## 14 | 11123444455556666677788890011234444555666777777789
## 16 | 00002233334444466788990000134445555789
## 18 | 12244444555677782225677778889999
## 20 | 0123444555557900245578
## 22 | 0012357701123344556788
## 24 | 08001699
## 26 | 12344569259
## 28 | 01780145
## 30 | 355798
## 32 | 12712
## 34 | 1
## 36 | 13
This produces a plot that (likely) scrolls your screen to display the whole thing. We can compare this to the default settings.
stem( ChickWeight$weight )
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 599999999
## 4 | 00000111111111111111111112222222222222223333456678888888899999999999+38
## 6 | 00111111122222222333334444455555666677777888888900111111222222333334+8
## 8 | 00112223344444455555566777788999990001223333566666788888889
## 10 | 0000111122233333334566667778889901122223445555667789
## 12 | 00002223333344445555667788890113444555566788889
## 14 | 11123444455556666677788890011234444555666777777789
## 16 | 00002233334444466788990000134445555789
## 18 | 12244444555677782225677778889999
## 20 | 0123444555557900245578
## 22 | 0012357701123344556788
## 24 | 08001699
## 26 | 12344569259
## 28 | 01780145
## 30 | 355798
## 32 | 12712
## 34 | 1
## 36 | 13
We see on the stem of 4 that we have so many leaves that the final 38 values are simply recorded as +38
at the end of the line.
stem( ChickWeight$weight, scale = 2 )
##
## The decimal point is 1 digit(s) to the right of the |
##
## 3 | 599999999
## 4 | 00000111111111111111111112222222222222223333456678888888899999999999
## 5 | 000000111111112222333334445555566667778888899999
## 6 | 001111111222222223333344444555556666777778888889
## 7 | 0011111122222233333444444446667778889999
## 8 | 0011222334444445555556677778899999
## 9 | 0001223333566666788888889
## 10 | 00001111222333333345666677788899
## 11 | 01122223445555667789
## 12 | 0000222333334444555566778889
## 13 | 0113444555566788889
## 14 | 1112344445555666667778889
## 15 | 0011234444555666777777789
## 16 | 0000223333444446678899
## 17 | 0000134445555789
## 18 | 1224444455567778
## 19 | 2225677778889999
## 20 | 01234445555579
## 21 | 00245578
## 22 | 00123577
## 23 | 01123344556788
## 24 | 08
## 25 | 001699
## 26 | 12344569
## 27 | 259
## 28 | 0178
## 29 | 0145
## 30 | 35579
## 31 | 8
## 32 | 127
## 33 | 12
## 34 | 1
## 35 |
## 36 | 1
## 37 | 3
Remark. While the user is certainly at liberty to change the values of scale
and width
, here at Statypus we encourage someone to do so only if the default stem and leaf plot has some shortcoming that they are looking to overcome.
3.4 Skew
In this section we will discuss the Skew of a distribution. However, before we begin, it is important to point out the distinction between the histograms we have been looking at and the underlying shape of the distribution. This concept is examined in more detail in Section 6.5.1, but we will boil that section down to the figure below dealing with the weight (in ounces) of Snickers bars.

Figure 3.27: Histograms and Underlying Distribution
When we discuss the shape of a distribution, we are really discussing the shape of the underlying “density” curve associated to it. (See Section 6.5.1 if you want to jump down that rabbit hole early.) That is to say, the entire population of Snickers bars (past, present, and future) is shown by the blue curve. While this population will never be truly infinite, it is so large that we do expect to see more “curve” to the data than flat boxes. The gray boxes are (as we just saw in Section 3.3.1) the histogram of a finite sample of Snickers bars and it is fairly clear that no finite number of Snickers bars or boxes in a histogram could produce a truly “smooth curve” like we see in blue above. However, this is the model that we use to describe the distribution.
However, not all data is as symmetric as the distribution of Snickers bars we see above. For example, the following plot shows data similar to the distribution of U.S. household incomes.

Figure 3.28: A Non-symmetric Distribution
In 2024, the middle (this is the mean which we will define in Section 4.1.2 if you want to take a look) U.S. household income was approximately $80,000 and that is represented by the vertical red line in the above plot. At least at a simplistic view, it is not possible to have a negative income, so our values are “boxed” into the positive (technically non-negative) region. This means there is only a region of $80,000 to the left of the red line, but it is clearly possible to make more than $80,000 more than the red line or even multiple times that. For example, a family that has an annual income of $400,000 would sit at the far right of the part of the graph shown. Any family making more than $400,000 a year would literally be “off the chart” and is not shown.
Both of these distributions have a single “peak” to them which is sometimes called the “mode” of the distribution. Here at Statypus, we have chosen to avoid diving into the technicalities of the mode or modes of a distribution and will simply discuss modes as being the “peaks” of distributions like the blue curves shown above.
Therefore we will say that a Unimodal Distribution is a distribution where there is only a single peak on the graph. An example of a distribution that is not unimodal is shown below. Since it has two peaks, it is common to call this a Bimodal Distribution, but we will not spend any time getting into those weeds here.

Figure 3.29: A Bimodal Distribution
Definition 3.3 While a completely technical definition will not be given here, we will say that a unimodal distribution is Skewed Right if the distribution takes on values further to the right of the peak of the graph than it does to the left.
Similarly, a unimodal distribution is said to be Skewed Left if it takes on values further to the left of the peak of the graph than it does to the right.
Remark. We will also use the notation of Right Skewed and Left Skewed when it seems appropriate.
The distribution shown in Figure 3.28 is an example of a right skewed distribution.
Example 3.13 There are a lot of distributions that are right skewed. A lot of variables, such as household income discussed earlier, cannot take on negative values and thus can stretch further to the right than it can to the left.
Non-contrived left skewed data is actually more rare. However, we offer a unique example here. We begin by downloading another dataset about births, but this one is much larger than BabyData1
. We will call this new dataset BabyData3
28
<- read.csv( "https://statypus.org/files/BigBabyData.csv" ) BabyData3
We begin by looking at the structure29 of the data.
str( BabyData3[ , 1:9 ] )
## 'data.frame': 101400 obs. of 9 variables:
## $ SEX : int 2 2 2 1 1 1 1 2 2 1 ...
## $ MARITAL: int 1 2 1 1 2 1 2 2 1 1 ...
## $ FAGE : int 33 19 33 25 21 21 29 23 27 30 ...
## $ MAGE : int 34 18 31 28 20 21 32 21 26 22 ...
## $ GAINED : int 26 40 16 40 60 30 20 41 0 30 ...
## $ VISITS : int 10 10 14 15 13 15 11 15 12 10 ...
## $ FEDUC : int 12 11 16 12 12 12 6 13 10 12 ...
## $ MEDUC : int 4 12 16 12 14 13 6 13 13 14 ...
## $ WEEKS : int 35 41 39 38 40 42 39 41 38 39 ...
This shows that BabyData3
contains data on over 100,000 different births. We will further investigate BabyData3
in Chapter 14, but will focus on only two variables here: FAGE
and `MAGE. They represent the age of the baby’s father and mother respectively.
Consider the following histogram.
<- BabyData3$MAGE[ BabyData3$FAGE == 40 ]
data <- 1
bin_width #Do not change any values below this comment
hist(
data, freq = TRUE,
breaks = seq(
from = min( data, na.rm = TRUE ) - 0.5,
to = max( data, na.rm = TRUE ) + bin_width - 0.5,
by = bin_width
)
)
abline( v = median( data ), col = "red") #Draws the red line

Figure 3.30: Left Skewed Distribution
What we see here is a histogram of the ages of mothers of babies whose father was 40 at the time of their birth. The red line indicates the middle value (again the median) of ages of the mothers.
This seems to say something about 40 year old men…
We will see how the concept of skew affects the statistics we get from samples in Section 4.5.
We can visualize skew with an exploration.
Exercises
Remark. For Exercise 3.2, load the HistData
package. See Example 1.8 in Section 1.8 for instructions for installing HistData
if you do not have it on your machine. If you already have it installed, you should only have to run library( HistData )
.
For Exercise 3.3, you will need to load the dataset Asylum1849
. The code for this can be found in the Now it’s your turn! which appears immediately following Example 3.1.
Exercise 3.1 Would a bar plot or histogram be the appropriate graphical display for the values in mtcars$wt
? Explain.
Exercise 3.2 Consider the dataset DrinksWages
in the HistData
package. Run ?DrinksWages
and look at the meaning of the class
column. Make a table and relative frequency bar plot of this variable. Which class is the least represented and what proportion of all observations does it represent?
Exercise 3.3 Create a Pareto chart for the married.single
variable in the Asylum1849
data frame.
Exercise 3.4 The data set ToothGrowth
deals with tooth growth in guinea pigs. Make a histogram of the values in the len
column for only those guinea pigs who received orange juice.
Exercise 3.5 Again using the ToothGrowth
data set, make a histogram of the values in the len
column for only those guinea pigs who received ascorbic acid.
Exercise 3.6 Based on the results of Exercises 3.4 and 3.5, can you see any difference in the value of len
based on the delivery method used?
Exercise 3.7 Using ToothGrowth
, give a histogram of the values in len
for each of the different dose levels, there should be three total histograms. Does it appear the dose level has an effect on the len
variable?
Exercise 3.8 In Section 3.3.3 there was a discrepancy between some of the stems and the associated bins in the histogram of the data mtcars$qsec
. List out the values in the stems and compare them to the values in the mtcars$qsec
. What is causing the discrepancy between the histogram and stem and leaf plot?
Fauna of Australia. Vol. 1b. Australian Biological Resources Study (ABRS)↩︎
To be precise,
PlatypusData1
is a part of the dataset titledID_Pop_Platypus.csv
which can be found on GitHub.↩︎We will dive into this much more later, but the curious reader can see the note following Example 4.16 in Section 4.4.1 or the footnote in Example 5.8 in Section 5.4.↩︎
The curious reader may wonder what happened to
BabyData2
or if even such a dataset exists. It does, but we won’t see that dataset until Chapter 14 where we investigate all three of the baby datasets.↩︎The observant reader may be curious as to the reason for the option
1:9
in the following line of code. The entire dataframe actually contains 37 variables for each observation and we included the option1:9
to only have R output the first 9 variables.↩︎