Chapter 3 Using Graphs to Understand Data

Figure 3.1: A Baby Statypus Learns to Write their Name

The platypus is a mammal that lays eggs. Yes, that is a thing. The platypus and the echidna are the only mammals that lay eggs.³¹

Raw data is often difficult to glean any useful information from. The concept of variability is central to sound statistical thinking and it is our goal to be able to understand how values in a population vary. One of the simplest tools available to someone working with data is to draw a picture. In this chapter, we will learn how to use RStudio to generate different graphical displays of data such as bar plots, histograms, and scatter plots. Graphs are a literal picture of the variability of data and are a fundamental way that most people interact with data. Advanced users can quickly glean information about a distribution by looking at an appropriate graph and see patterns that may be difficult to see numerically. This is our first reason why it is important to be able to leverage the power of the R software.

New R Functions Used

All functions listed below have a help file built into RStudio. Most of these functions have options which are not fully covered or not mentioned in this chapter. To access more information about other functionality and uses for these functions, use the ? command. E.g. To see the help file for barplot, you can run ?barplot or ?barplot() in either an R Script or in your Console.

We will see the following functions in Chapter 3.

table(): Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
proportions(): Returns conditional proportions given entries of x divided by the appropriate sum(s).
barplot(): Creates a bar plot with vertical or horizontal bars.
hist(): Computes a histogram of the given data values.
stem(): Produces a stem-and-leaf plot of the values in x.

3.1 Graphing Qualitative Data

We recall that Qualitative Data differs from Quantitative Data most fundamentally because it is impossible, or illogical, to perform arithmetic operations on qualitative data. Thus, all we can really analyze is how frequent different values of the variable occur and we can visualize this with graphs.

3.1.1 Making Tables

Our first tool we will use is not technically a graph, but will be needed for some of our basic graphs. To get started, we create a simple example to analyze. The following code chunk creates a random vector of colors using the sample function we saw in Chapter 2.

colorsList <- c( "Blue", "Orange", "Green", "Yellow", "Red", "Brown" )
colors <- sample( x = colorsList, size = 30, replace = TRUE )
colors

##  [1] "Yellow" "Orange" "Brown"  "Red"    "Yellow" "Blue"   "Red"    "Brown" 
##  [9] "Yellow" "Orange" "Brown"  "Orange" "Brown"  "Brown"  "Yellow" "Brown" 
## [17] "Brown"  "Brown"  "Yellow" "Yellow" "Red"    "Yellow" "Green"  "Yellow"
## [25] "Red"    "Orange" "Red"    "Orange" "Brown"  "Green"

The above is the “raw” data and can be used in many ways. The simplest is to look at how many times each day was selected, or the frequency of each number. This will create the frequency table of our dataset. To do this, we introduce the table() function.

The syntax of table is

table( x, exclude, useNa )

where the arguments are:

exclude: Can be used to exclude certain values of a vector from the table.
useNa: Used to determine whether to include NA values in the table. The choices are "no", "ifany", and "always".

We can now graph the vector colors that we defined above. We will omit the exclude and uneNA arguments for now.

table( colors )

## colors
##   Blue  Brown  Green Orange    Red Yellow 
##      1      9      2      5      5      8

This table gives each outcome with its frequency. It says that we selected BLUE only once, Brown a total of 9 times, and so on.

Example 3.1 We will not use the arguments above very often, the argument useNA can be used to control whether or not to include a column letting you know how many values were missing in the dataset. In most cases, useNA is set to "no" by default, To see how "ifany" and "always" differ, we first define a simple character vector, $x$, as follows

x <- c("a", "a", "a", "b", "b", NA)

then we can get the table below.

table( x )

## x
## a b 
## 3 2

This table omits the last NA value. To see it in the table, we include either useNA = "ifany" or useNA = "always".

table( x, useNA = "ifany" )

## x
##    a    b <NA> 
##    3    2    1

To see the difference between "ifany" and "always", we note that if we used "ifany" for a colors table we would not change the table we obtained above.

table( colors, useNA = "ifany")

## colors
##   Blue  Brown  Green Orange    Red Yellow 
##      1      9      2      5      5      8

However, if we used "always" for a colors table, we would get the following.

table( colors, useNA = "always")

## colors
##   Blue  Brown  Green Orange    Red Yellow   <NA> 
##      1      9      2      5      5      8      0

The argument exclude does exactly what it says it does, it excludes certain values. For example, we can exclude just the values of Yellow from colors using

table( colors, exclude = "Yellow" )

## colors
##   Blue  Brown  Green Orange    Red 
##      1      9      2      5      5

or exclude all of the primary colors rolls with

table( colors, exclude = c( "Red", "Blue", "Yellow" ) )

## colors
##  Brown  Green Orange 
##      9      2      5

Raw data often contains missing values that may not be labeled as NA or even the same way across a dataset. It can be necessary to know how to implement certain arguments of functions to handle such cases. Here, we see that exclude can be used to remove different types of “missing” values which may appear as NA, NULL, none, or something else altogether.

Load the dataframe called Asylum1849.

Asylum1849 <- read.csv("https://statypus.org/files/Asylum1849.csv")

We saw this data frame in Chapter 2 where we mentioned that it contains intake information for the Meerenberg Insane Asylum in Brederodelaan, Santpoort, The Netherlands.

Look at str( Asylum1849 ) and look at the variables contained in the data. Which variables are qualitative? Which ones are sensible to be represented by a table? Make a table for the appropriate variables.

3.1.2 Bar Plots

Our main tool for graphing qualitative data is a bar plot (sometimes called a bar graph or bar chart). We can create a bar plot using the function barplot().

The syntax of barplot is

barplot( height )

where the argumens we use here is

height: The output of a table.

There are many options/arguments which can be used with graphical functions which can be used to make better looking graphs. The view of taken by Statypus is that it is fairly easy to make R produce graphs, but making publishable level graphs may take considerable effort and/or skill.

Further, most R professionals would use an advanced graphical package such as ggplot2 to produce better graphical summaries of data.

To see the way barplot will work, we will create a vecor called rolls and attempt to make a barplot of it. The following is a random collection of rolls of a six sided die.

rolls <- c( 1, 4, 6, 1, 6, 6, 6, 6, 1, 6, 5, 3, 5, 2, 4, 2, 6, 6, 1, 6, 3, 5, 2, 5, 1 )

If we try to use barplot directly with the vector rolls, we get a very weird chart as we see below.

barplot( rolls )

Figure 3.2: Bad Bar Plot of rolls

This is not a proper bar plot because each bar is actually relating to the value of each roll. Looking back at rolls, the first five values were $1,4,6,1, \text{and } 6$ which correspond exactly with the first five bars in the plot above. The barplot function must take in a value known as height which consists of a sequence of values in a vector. This is exactly what the table function does. So, to use barplot with raw data, we simply run table on our raw data and pass that output into the barplot function. We can now use barplot to get a visualization of rolls.

barplot( table( rolls ) )

Figure 3.3: Better Bar Plot of rolls

This gives exactly the same information as table does, as we show below, but does it in a graphical way.

table( rolls )

## rolls
## 1 2 3 4 5 6 
## 5 3 2 2 4 9

Unless you are working with already tabulated data, that is data already in the form of a table, you will need to pair barplot with table.

You may have noticed that rolls is actually qualitative data. If we had attemped to use barplot on raw qualitative data, it would throw an error. Feel free to try and run barplot on colors without pairing it with table.

If we wanted to get the proportion, or relative frequency, of each roll and not the frequency, we can inject the R function proportions() as below:

barplot( proportions( table( rolls ) ) )

Figure 3.4: Relative Frequency Bar Plot of rolls

We still need to include table to get the relative frequency bar plot.
The graphical output of a frequency and relative frequency bar plot are identical except for the labels on the vertical axis.
The proportions function can be used with table with or without the barplot function. The composition function, proportions( table( x ) ) will produce a relative frequency table.

We summarize this with the following:

If you have a vector, x, or column of a dataframe, df$Col, of values of a qualitative variable, we can make a bar plot of the frequencies it via:

#Will only run if x is defined
barplot( table( x ) )

barplot( table( df$Col ) )          #Will only run if df is appropriately defined

If we want relative frequencies instead, we use the following code:

barplot( proportions( table( x ) ) )    #Will only run if x is defined

barplot( proportions( table( df$Col ) ) )     #Will only run if df is appropriately defined

Example 3.2 (Type of Transition) Let’s look at the built in dataset mtcars. Use ?mtcars for more information about this dataset.

head( mtcars )

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

and look at the column mtcars$am which gives the type of transmission of each car in the dataframe where 0 = automatic and 1 = manual. This variable was also looked at in Exercise 1.11 where we saw how to convert it into a variable type that looked more qualitative. We can create a frequency bar plot for this variable with:

barplot( table(mtcars$am ) )

Figure 3.5: Frequency Bar Plot of Transmission Type

or a relative frequency bar plot via:

barplot( proportions( table( mtcars$am ) ) )

Figure 3.6: Relative Frequency Bar Plot of Transmission Type

We could also use the following three lines of code in which the second and third lines are directly copied from above.

x <- mtcars$am
barplot( table( x ) )
barplot( proportions( table( x ) ) )

This method of defining x and then using the stock code may be preferred for some readers, however, at this point we will be attempting to give single lines of code that work.

If you have not installed the package HistData, refer to Section 1.8 and do so. Load the package using the following line of code or selecting the check box next to HistData in the Packages tab of the lower right pane of RStudio.

library(HistData)

Look at dataset DrinkWages and look at ?DrinkWages to learn a bit about the data. Make a bar plot showing the frequency of each wage class in the dataframe. Write full sentences explaining what the bar plot shows.

Definition 3.1 A Pareto Chart is a bar plot where the bars are ordered in descending order from left to right. A Pareto Chart can be be stated in terms of either observed frequency or relative frequency.

To get a Pareto chart of a qualitative variable, use the following code.

#Will only run if x is defined
barplot( sort( table( x ), decreasing = TRUE ) )

barplot( sort( table( df$Col ), decreasing = TRUE ) )
#Will only run if df is appropriately defined

You can also include the function proportions in between sort and table to make the Pareto chart display relative frequencies.

Example 3.3 We can use the Bulls1996 data frame we introduced in Example 2.16. We begin by offering the code to download it if you do not have it on your machine.

Bulls1996 <- read.csv( "https://statypus.org/files/ChicagoBulls1996.csv")

We remind ourselves what data is contained in Bulls1996 by running str on it.

str( Bulls1996 )

## 'data.frame':    15 obs. of  8 variables:
##  $ No.       : int  0 30 35 53 54 9 23 25 7 13 ...
##  $ Player    : chr  "Randy Brown" "Jud Buechler" "Jason Caffey" "James Edwards" ...
##  $ Pos       : chr  "PG" "SF" "PF" "C" ...
##  $ Ht        : int  74 78 80 84 82 78 78 75 82 86 ...
##  $ Wt        : int  190 220 255 225 240 185 198 175 192 265 ...
##  $ Birth.Date: chr  "May 22, 1968" "June 19, 1968" "June 12, 1973" "November 22, 1955" ...
##  $ Exp       : int  4 5 0 18 6 9 10 7 2 4 ...
##  $ College   : chr  "Houston, New Mexico State" "Arizona" "Alabama" "Washington" ...

We can see give an example of a Pareto chart by using the Pos variable which gives the primary position that that player played.

barplot( sort( table( Bulls1996$Pos ), decreasing = TRUE ))

Figure 3.7: Pareto Chart of 1996 Bulls Positions

3.2 Comparing Qualitative Variables

Lets return to the mtcars dataframe and see how the number of cylinders a car has is related to the type of transmission it has. Using ?mtcars lets us know that a value of 0 in the am column means that the car has an automatic transmission while a value of 1 indicates it is a manual transmission. We can basically redo the work in the Section 3.1 using the exact same functions.

3.2.1 Two Way Tables

To get to a Comparative Bar Plot, which we will introduce in Section 3.2.2, we first look at a two way table using table again. Using ?table and looking at the Arguments section shows that the main input, listed as ..., is “one or more objects which can be interpreted as factors.” We can thus pass two difference columns of mtcars to table as below:

table( mtcars$am , mtcars$cyl )

##    
##      4  6  8
##   0  3  4 12
##   1  8  3  2

This table says that there are 3 cars which have an automatic transmission as well as 4 cylinders, 4 cars which have an automatic transmission as well as 6 cylinders, and so on.

We could also reverse the order of the columns in table to get:

table( mtcars$cyl , mtcars$am )

##    
##      0  1
##   4  3  8
##   6  4  3
##   8 12  2

This shows us that the entries in table behave like the options used with a dataframe via df[ , ]. The first entry controls the columns and the second entry controls the columns. Making a good graph may require the use to test which order produces the most understandable graph.

Refer back to the “Now it’s your turn!” example from Section 3.1.1 and use the code there to load the dataframe called Asylum1849 if necessary. Choose a couple of pairs of qualitative variables and construct two way tables to investigate the relationship between the variables. Can you see any patterns?

We will not be able to make an explicit decision if two variables are related until we get the machinery introduced in Chapter 11. For now, we can simply comment on patterns that appear to exist based on the tables we are able to create.

3.2.2 Comparative Bar Plot

Now that we are comfortable creating two way tables, we can proceed to visualizations of them. One way to visualize two way tables is to use a comparative bar plot. We can do this with the barplot function.

The new syntax for barplot is

barplot( height, beside, legend.text )

where the arguments we use here are

height: Now the output of a two way table.
beside: If FALSE or F, the columns of height are portrayed as stacked bars, and if TRUE or T the columns are portrayed as juxtaposed bars.
legend.text: Used to give more information to make our graphs more understandable.

Like before, barplot needs to be fed tabulated values, so you will need to also use table if working with raw data.

Returning to the example we saw in Section 3.2.1, we can use barplot to achieve the following.

barplot( table( mtcars$cyl , mtcars$am ) , beside = TRUE , legend.text = TRUE )

Figure 3.8: Comparative Bar Plot of cyl and am from mtcars Version 1

As mentioned before, we could reverse the roles of cyl and am to get an different way of displaying the same information.

barplot( table( mtcars$am , mtcars$cyl ), beside = TRUE, legend.text = TRUE )

Figure 3.9: Comparative Bar Plot of cyl and am from mtcars Version 2

Here us are using beside = TRUE to make sure the bars are side by side, but we also show what beside = FALSE looks like below.

barplot( table( mtcars$am, mtcars$cyl ), beside = FALSE, legend.text = TRUE )

Figure 3.10: Comparative Bar Plot of cyl and am from mtcars Version 3

It is important to remember that although the graphs may appear different when we change the arguments of barplot, the actual data has not changed and the differences are purely aesthetic. It is the job of the reader to always be a critical consumer and make sure they understand the graph they are viewing.
As mentioned before, there are plenty of options available to make the above graphs better and avoid things like the awkward legend position above. However, this document will focus on trying to produce simple graphs.

We summarize the work of this section as well as Section 3.2.1 with the following code templates.

To get a Two Way Table of counts based on the columns Col1 and Col2 of a dataframe, df we use the code:

#Will only run if df is appropriately defined
table( df$Col1, df$Col2 )

To get a Comparative Bar Plot based on counts based on the columns Col1 and Col2 of a dataframe, df we use the code:

#Will only run if df is appropriately defined
barplot( table( df$Col1, df$Col2 ), beside = TRUE, legend.text = TRUE )

Make a comparative bar plot of the number of gears a car had based on the shape of the engine. The two variables are gear which gives the number of forward gears a car had and vs which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside argument. Can you see a relationship between the variables?

3.3 Graphing Quantitative Data

Quantitative data differs from qualitative data in the sense that it is always at the interval level and thus we can separate data into equally sized groups or “bins.” Doing so allows us to plot the frequency of not just individual values, but intervals of values that are close. For example, we may group all cars that have between 20 and 25 miles per gallon fuel efficiency. This type of graph is called a histogram and can be used for both continuous and discrete data types, see Section 2.2.4 for the details on the distinction.

3.3.1 Histograms for Continuous Data

We will investigate using a histogram for continuous data by using a new variable in mtcars. The variable mpg in the mtcars dataframe gives the fuel efficiency of each car in miles per gallon. We can make a histograms using the R function hist().

The syntax of hist is

hist( x, right, breaks, freq )

where the arguments are:

right: By default, right = TRUE the histogram cells are intervals of the form $(a, b]$, i.e., they include their right-hand endpoint, but not their left one. If the user specifies that right = FALSE, the intervals are of the form $[a, b)$ which is common in many other introductory statistics textbooks.
breaks: This controls the number of bins being used in the histogram. The simplest way to control this is by setting the argument equal to an integer which R will use to give the number of equal spaced breaks used.
freq: (Usually) By default, freq = TRUE and the histogram graphic is a representation of frequencies. If the user specifies freq = FALSE, relative frequencies are used.

R attempts to give the best histogram it can by default, but the user can control nearly every aspect of it.
R may not use the exact number of cells you specify in breaks and there are more direct tools to have more control, but this document does not go into those other than the ones in Section 3.3.2.

We can now make a histogram of the fuel efficiency of cars in the mtcars dataframe by using the following.

hist( mtcars$mpg )

Figure 3.11: Histogram of Fuel Efficiency

From here we can see that there were 8 vehicles that had a fuel efficiency between 20 and 25 miles per gallon. To be precise, there were 8 vehicles in mtcars whose value in the column mpg was in the interval $(20,25]$. If a car had had a fuel efficiency of exactly 20 mpg, it would occur in the bin given by $(15,20]$. As mentioned in the remark above, it is possible to use the breaks option to control the values of the bins directly, but the use of a package such as ggplot2 is likely suggested if someone is wanting to create graphs that deviate too far from the default graphics produced by R.

It is important to remember that any differences in plots produced are completely attributable to the selection of the arguments and do not mean there has been any “change” in the data itself. The manipulation of graphs is often done to make a graph tell the story that the creator wants it to. There is the famous quote which Mark Twain attributed to the British Prime Minister Benjamin Disraeli in Chapters from My Autobiography, published in the North American Review in 1907:

“There are three kinds of lies: lies, damned lies, and statistics.”³²

While slightly allegorical, it does point out that statistics can be used to deceive people with the subtleness needed to fully understand statistical graphs and calculations.

Example 3.4 To get a feel for what the breaks argument does, we run a few examples of histograms with different values of breaks. To do this, we introduce a new dataset called PlatypusData1.

PlatypusData1 <- read.csv( "https://statypus.org/files/PlatypusData1.csv" )

As this is new data, we begin by looking at it with the functions str and head.

str( PlatypusData1 )

## 'data.frame':    218 obs. of  4 variables:
##  $ Weight       : num  1.12 0.94 NA 1.18 1.09 0.7 1.55 1 0.52 1.4 ...
##  $ Sex          : chr  "M" "F" "F" "M" ...
##  $ AgeClass     : chr  "SA" "A" "A" "A" ...
##  $ concentration: num  2.1 3.26 8.38 2.5 2.14 3.96 16.4 5.6 9.6 4.44 ...

This tells us that PlatypusData1 is a dataframe that consists of 218 observations, or individuals, each of which has 4 variables recorded for it. The variable Sex is a character string and we can see the possible values it has by running table.

table( PlatypusData1$Sex )

## 
##       F   M 
##   1 109 108

This shows that our dataset contains observations of 109 females and 108 males as well as a single observation where the sex is not recorded. The variable AgeClass is also of the character type and contains the age of the platypus as shown below.

table( PlatypusData1$AgeClass )

## 
##       A   J  SA 
##   1 167  46   4

Platypuses are not easily bred in captivity, and as such, exact ages are rarely known. As such, this data has classified the platypuses into either Juvenile (J), Adult (A), or Senior Adult (SA). We also see that there is a single observation does not have a recorded value for this variable. The interested reader is welcome to examine the article that the data originates from³³, but we warn you that the article is quite scientifically dense. The variable concentration deals with the concentration of genomic DNA in a blood sample. We will examine and further discuss this variable later.

We will be focusing on the variable Weight which gives the mass³⁴ of each platypus.

To see what hist will do by default, we simply run the following code.

hist( PlatypusData1$Weight )

Figure 3.12: Histogram of Platypus Masses with Default breaks

Here we see that R uses 11 bins when making a histogram for PlatypusData1$Weight. We can try to force R to use only a single bin by setting breaks = 1:

hist( PlatypusData1$Weight, breaks = 1 )

Figure 3.13: Histogram of Platypus Masses with breaks = 1

Instead of a single bin, we get a single “break” and two different bins. To see if this is indeed how breaks work, we proceed to produce histograms with breaks = 2 and breaks = 3.

hist( PlatypusData1$Weight, breaks = 2 )

Figure 3.14: Histogram of Platypus Masses with breaks = 2

hist( PlatypusData1$Weight, breaks = 3 )

Figure 3.15: Histogram of Platypus Masses with breaks = 3

The histogram using breaks = 2 worked as we anticipated, but the histogram with breaks = 3 had 4 breaks and 5 bins. These examples show one thing: setting breaks to an integer asks R to use that many bins, but R will choose the number of bins it feels is best that is near the value we see.

Investigating further, we can also increase the number of bins with

hist( PlatypusData1$Weight, breaks = 10 )

Figure 3.16: Histogram of Platypus Masses with breaks = 10

or even use more bins to get:

hist( PlatypusData1$Weight, breaks = 20 )

Figure 3.17: Histogram of Platypus Masses with breaks = 20

In all cases, the number of bins is near the value of breaks but fails to match in many cases. These examples reinforce our earlier remark that if the user wants more refined control, they will need to invest energy into learning a bit more coding or use advanced packages such as ggplot2.

Example 3.5 To see what freq does, we recall what the default histogram of mtcars$mpg looked like:

hist( mtcars$mpg )

Figure 3.18: Frequency Histogram of Fuel Efficiency

We now can set freq = FALSE (or simply freq = F) to get:

hist( mtcars$mpg, freq = FALSE )

Figure 3.19: Relative Frequency Histogram of Fuel Efficiency

This shows that freq = FALSE gives a Relative Frequency Histogram. Setting freq = TRUE or simply omitting the freq argument (because TRUE is the default value) gives a standard Frequency Histogram. Like bar plots, the shape of a histogram does not change when we move from frequencies to relative frequencies, only the values along the vertical axis do.

To make a basic histogram of a vector x, you can use the code:

#Will only run if x is defined
hist( x, freq )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
hist( df$Col, freq )

Where we set freq = FALSE if we want relative frequencies rather than just frequencies. We can set freq = TRUE or omit freq if we want regular frequencies.

Example 3.6 (Working With Only Part of a Dataframe) Sometimes, we want to look at only a portion of a dataset. For example, if we want to make a histogram of the masses of just the male platypuses, we need to find just the portion of PlatypusData1, introduced and imported in Example 3.4, that contains this data. Recall that we can find a subset of PlatypusData1 by using PlatypusData1[ R , C ] where R is a rule used to choose the columns we want and C is a rule for the columns we want to use. For R we will use the command PlatypusData1$Sex == "M" which will effectively select only those rows where the platypus is male. If we look at PlatypusData1[ PlatypusData1$Sex == "M" , ] we get the following:

head( PlatypusData1[ PlatypusData1$Sex == "M" , ] )

##    Weight Sex AgeClass concentration
## 1    1.12   M       SA          2.10
## 4    1.18   M        A          2.50
## 5    1.09   M        A          2.14
## 7    1.55   M        A         16.40
## 9    0.52   M        J          9.60
## 10   1.40   M        A          4.44

We have successfully selected just the rows of male platypuses. There are a couple of ways of progressing from here.

If we planned on doing more analysis of the male platypuses, it may be helpful to create a new dataframe of just these cars. We could do this via

MalePlatys <- PlatypusData1[ PlatypusData1$Sex == "M", ]

If you now view MalePlatys you should see the same output as when we ran PlatypusData1[ PlatypusData1$Sex == "M" , ]. However, we can now simply pull the column Weight out of this dataframe by running the code below.

MalePlatys$Weight

##   [1] 1.12 1.18 1.09 1.55 0.52 1.40 1.47 1.37 1.27 1.45 0.47 1.33 1.76 0.66 1.50
##  [16] 1.30 1.33 1.47 1.53 0.91 1.25 1.58 1.67 0.64 1.11 1.29 1.42 0.90 1.48 0.76
##  [31]   NA 1.53 1.20 1.52 1.48 1.60 1.36 1.56   NA 1.59 1.23   NA 1.10 1.21 1.62
##  [46] 1.68 0.90 1.79 1.10 1.43 1.47 1.64 0.88 1.69   NA   NA 1.80 1.83 1.84 1.74
##  [61] 1.45 1.15   NA 1.77 1.05 1.80 2.01 1.60 1.07   NA 0.84 2.11 1.15 1.73 1.84
##  [76] 1.58 1.03 0.82 2.08 1.55 1.66 1.39 1.64 1.86 1.84 0.11 0.85 1.51 0.93 1.79
##  [91] 1.50 1.18 1.46 1.28 1.36 1.56 0.76 1.02 1.99 1.58 0.82 1.64 0.90 0.93 1.34
## [106] 1.36 1.14 1.68

Since we now have a vector, we can now run:

hist( MalePlatys$Weight )

Figure 3.20: Frequency Histogram for Male Platypuses, Method 1

However, if we had no desire to investigate MalePlatys further, it may not be beneficial to create a new dataframe. We can pull just the horsepower values by using the following code:

PlatypusData1[ PlatypusData1$Sex == "M", "Weight" ]

##   [1] 1.12 1.18 1.09 1.55 0.52 1.40 1.47 1.37 1.27 1.45 0.47 1.33 1.76 0.66 1.50
##  [16] 1.30 1.33 1.47 1.53 0.91 1.25 1.58 1.67 0.64 1.11 1.29 1.42 0.90 1.48 0.76
##  [31]   NA 1.53 1.20 1.52 1.48 1.60 1.36 1.56   NA 1.59 1.23   NA 1.10 1.21 1.62
##  [46] 1.68 0.90 1.79 1.10 1.43 1.47 1.64 0.88 1.69   NA   NA 1.80 1.83 1.84 1.74
##  [61] 1.45 1.15   NA 1.77 1.05 1.80 2.01 1.60 1.07   NA 0.84 2.11 1.15 1.73 1.84
##  [76] 1.58 1.03 0.82 2.08 1.55 1.66 1.39 1.64 1.86 1.84 0.11 0.85 1.51 0.93 1.79
##  [91] 1.50 1.18 1.46 1.28 1.36 1.56 0.76 1.02 1.99 1.58 0.82 1.64 0.90 0.93 1.34
## [106] 1.36 1.14 1.68

We can then give a histogram of this by putting it into hist via:

hist( PlatypusData1[ PlatypusData1$Sex == "M", "Weight" ] )

Figure 3.21: Frequency Histogram for Male Platypuses, Method 2

There are many ways to accomplish the above task with R. For example PlatypusData1$Weight[ PlatypusData1$Sex == "M" ] is equivalent to PlatypusData1[ PlatypusData1$Sex == "M", "Weight" ], but we recommend the latter form as it allows for easier generalization if we wanted more than just one column. For example, the following code gives (the head of) a dataframe containing both the mass and age class of male platypuses.

head( PlatypusData1[ PlatypusData1$Sex == "M", c( "Weight", "AgeClass" ) ] )

##    Weight AgeClass
## 1    1.12       SA
## 4    1.18        A
## 5    1.09        A
## 7    1.55        A
## 9    0.52        J
## 10   1.40        A

The output is a new dataframe consisting of the portion of PlatypusData1 with only the rows where Sex is "M" and only the columns Weight and AgeClass.

You can also use the code PlatypusData1[ PlatypusData1$Sex == "M", ]$Weight and a few other variations. There is often not a single way to accomplish your goal in R and we encourage the reader to try things and see what they do.

3.3.2 Histograms for Discrete Data

When graphing continuous data, R typically does a fine job choosing the bins to split the data into. However, the hist function typically does not do a great job at making graphs of discrete data. For example, consider the following vector of dice rolls:

x <- c( 6, 1, 2, 2, 2, 3, 2, 5, 4, 4, 3, 6, 5, 4, 1 )

We can tabulate these to see what a histogram should look like.

table( x )

## x
## 1 2 3 4 5 6 
## 2 4 2 3 2 2

If we use hist to plot this, we get the following:

hist( x )

Figure 3.22: Bad Histogram of Dice Rolls Version 1

It appears that the counts for 1 and 2 have been combined. We could try the option of right = FALSE to solve this, but that gives the following:

hist( x, right = FALSE )

Figure 3.23: Bad Histogram of Dice Rolls Version 2

While this did fix the issue we had with 1s and 2s combining, we now get that the 5s and 6s have been combined. A more traditional way to display discrete data with integer values is to have bins offset by one half. That is, we want the bin that is used for the value of 3, for example, to be from 2.5 to 3.5. We will illustrate this via a larger dataset.

Example 3.7 We begin by loading in the file called BabyData1. This file contains data about the births of a sample of babies.

BabyData1 <- read.csv( "https://statypus.org/files/BabyData1.csv" )

As mentioned in Chapter 1, we should begin our investigation BabyData1 by using str and head.

str( BabyData1 )

## 'data.frame':    200 obs. of  12 variables:
##  $ mom_age      : int  35 22 35 23 23 26 25 32 41 22 ...
##  $ dad_age      : int  35 21 42 NA 28 31 37 38 39 24 ...
##  $ mom_educ     : int  6 3 4 1 4 3 5 5 4 3 ...
##  $ mom_marital  : int  1 1 1 1 1 2 1 1 1 2 ...
##  $ numlive      : int  2 1 0 2 0 1 0 1 0 0 ...
##  $ dobmm        : int  2 3 6 8 9 10 7 12 11 2 ...
##  $ gestation    : int  39 42 39 40 42 39 38 38 36 40 ...
##  $ sex          : chr  "F" "F" "F" "F" ...
##  $ weight       : int  3175 3884 3030 3629 3481 3374 2693 4338 2834 2948 ...
##  $ prenatalstart: int  1 2 2 1 2 4 1 1 2 1 ...
##  $ orig_id      : int  1047483 1468100 2260016 3583052 795674 3544316 3726920 2606970 2481971 243759 ...
##  $ preemie      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

This shows us that BabyData1 is a dataframe and contains variables such as the age of the mother and the sex of the baby born. As the dataframe has 200 rows (or observations as it is listed in str), it is helpful to look at only a portion of the data using head now.

head( BabyData1 )

##   mom_age dad_age mom_educ mom_marital numlive dobmm gestation sex weight
## 1      35      35        6           1       2     2        39   F   3175
## 2      22      21        3           1       1     3        42   F   3884
## 3      35      42        4           1       0     6        39   F   3030
## 4      23      NA        1           1       2     8        40   F   3629
## 5      23      28        4           1       0     9        42   F   3481
## 6      26      31        3           2       1    10        39   M   3374
##   prenatalstart orig_id preemie
## 1             1 1047483   FALSE
## 2             2 1468100   FALSE
## 3             2 2260016   FALSE
## 4             1 3583052   FALSE
## 5             2  795674   FALSE
## 6             4 3544316   FALSE

This shows us a portion of BabyData1 but not in a way that would scroll our screen to lose sight of the information contained in the column names and labels. We will look at the data in the variable mom_age which lists the mother’s age at the time of the baby’s birth. We see that it is of type int which means we only have integer values in this variable. As such, to avoid the issues we had with the dice rolls, we will need to be much more explicit in how we setup our hist function.

The following code will give a histogram of BabyData1$mom_age where we want each bar of our histogram to include only 1 value.

data <- BabyData1$mom_age      #Entered by the User
bin_width <- 1                 #Entered by the User
#Do not change any values below this comment
hist( 
      data, 
      freq = TRUE, 
      breaks = seq( 
                      from = min( data, na.rm = TRUE ) - 0.5, 
                      to = max( data, na.rm = TRUE ) + bin_width - 0.5, 
                      by = bin_width
                  )
    )

Figure 3.24: Histogram of Mothers’ Ages Version 1

If we wanted to smooth our the histogram by grouping the ages into bins of width 3, for example we could combine the ages 30, 31 and 32 into a single bin, we can easily accomplish this by setting the value of bin_width to be the number of values to group together. Below we set bin_width to be 3 and get a new histogram.

data <- BabyData1$mom_age      #Entered by the User
bin_width <- 3                 #Entered by the User
#Do not change any values below this comment
hist( 
      data, 
      freq = TRUE, 
      breaks = seq( 
                      from = min( data, na.rm = TRUE ) - 0.5, 
                      to = max( data, na.rm = TRUE ) + bin_width - 0.5, 
                      by = bin_width
                  )
    )

Figure 3.25: Histogram of Mothers’ Ages Version 2

The following code chunk should only be used for discrete data. See the Code Template in Section 3.3.1 if your data is not discrete. As a reminder, some data may be stored as integer values, but be from a continuous variable. See Section 2.2.4 for a reminder about discretized data.

To plot a histogram of discrete data, we can use the following code chunk. The user must supply the discrete data they wish to graph in the place of x and also the width of the bins they would like to use in as an integer in the place of w.

#Data must be supplied
data <- x         
#Value must be given  
bin_width <- w            
#Do not change any values below this comment
hist( 
      data, 
      freq = TRUE, 
      breaks = seq( 
                      from = min( data, na.rm = TRUE ) - 0.5, 
                      to = max( data, na.rm = TRUE ) + bin_width - 0.5, 
                      by = bin_width
                  )
    )

Using the x vector from the beginning of Section 3.3.2, create a histogram that correctly displays the data.

3.3.3 Stem and Leaf Plots

To make a Stem and Leaf Plot we simply use the R function stem(). Stem and leaf plots are a fairly unique way to graph data, so we will introduce it via example.

The syntax of table is

stem( x, scale, width )

where the arguments are:

scale: This controls the plot length.
width: The desired width of plot.

Example 3.8 We begin by looking at the variable qsec in mtcars which measures the quarter mile time of the car in seconds. We first look at the data and then a histogram of it.

mtcars$qsec

##  [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40
## [13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41
## [25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60

hist( mtcars$qsec )

Figure 3.26: Histogram of Quarter Mile Times

We now implement the stem() function.

stem( mtcars$qsec )

## 
##   The decimal point is at the |
## 
##   14 | 56
##   15 | 458
##   16 | 5799
##   17 | 00134468
##   18 | 00356699
##   19 | 459
##   20 | 002
##   21 | 
##   22 | 9

If you were to turn the stem and leaf plot ninety degrees counterclockwise, we would see that these two graphs show nearly the same information. The histogram uses vertical rectangles to show the frequencies and the stem and leaf plot uses the leaf to mark each occurrence of a value with a specified stem.

The first value in mtcars$qsecwas 16.46. Looking at the stem and leaf plot we see the number 16 on the left side of the vertical lines. To the right of the | we see 5799. This means that our dataset contains values which round to 16.5, 16.7, 16.9, and 16.9. The 16 is known as the stem and the values 5, 7, 9, and 9 are the leaves of that stem. Leaves are always a single integer and are in numerical order of the value they represent. The stem of 21 is included regardless of the fact that it has no stems just as the histogram does not omit the portion of the number line containing 21.

The astute reader may notice a discrepancy between stem and leaf for the stems of 17 and 18 and the associated bins in the histogram of mtcars$qsec. This is explored in Exercise 3.8.

Example 3.9 (Split Stems) R often will automatically make split stems when necessary as shown with mtcars$wt which gives the vehicle’s weight in pounds.

stem( mtcars$wt )

## 
##   The decimal point is at the |
## 
##   1 | 5689
##   2 | 123
##   2 | 56889
##   3 | 22224444
##   3 | 55667888
##   4 | 1
##   4 | 
##   5 | 334

Here there are two stems with a value of 2. One of them has the leaves 0 through 4 and the other has the leaves 5 through 9.

We can also use the scale argument to manually split stems for some datasets. Setting scale = 2 will roughly double the length (number of stems) as compared to the default settings. We can apply this to mtcars$qsec which had standard stems in Example 3.8.

stem( mtcars$qsec, scale = 2 )

## 
##   The decimal point is at the |
## 
##   14 | 56
##   15 | 4
##   15 | 58
##   16 | 
##   16 | 5799
##   17 | 001344
##   17 | 68
##   18 | 003
##   18 | 56699
##   19 | 4
##   19 | 59
##   20 | 002
##   20 | 
##   21 | 
##   21 | 
##   22 | 
##   22 | 9

Example 3.10 (Double Stems) R can also double stems as seen by looking at the variable we looked at before, mtcars$mpg.

stem( mtcars$mpg )

## 
##   The decimal point is at the |
## 
##   10 | 44
##   12 | 3
##   14 | 3702258
##   16 | 438
##   18 | 17227
##   20 | 00445
##   22 | 88
##   24 | 4
##   26 | 03
##   28 | 
##   30 | 44
##   32 | 49

Looking at the stem of 18 we see values which are (when rounded) 18.1, 18.7, 19.2, 19.2, and 19.7. We know this because the values are always in order, so 18.2 cannot follow 18.7. Unfortunately, this can lead to some ambiguity as is the case with the 26 stem. The leaves of 0 and 3 could be either 26.0 and 26.3, 26.0 and 28.3, or 28.0 and 28.3. (28.0 and 26.3 would not be possible due to the stems having to be in ascending order of their data values.)

We can also use the scale argument to manually double stems for some datasets. Setting scale = 0.5 will roughly halve the length (number of stems) as compared to the default settings. We can apply this to mtcars$qsec which had standard stems in Example 3.8.

stem( mtcars$qsec, scale = 0.5 )

## 
##   The decimal point is at the |
## 
##   14 | 56458
##   16 | 579900134468
##   18 | 00356699459
##   20 | 002
##   22 | 9

Unfortunately, the scale argument for stem is a bit like breaks for histograms where setting breaks = 10, for example, gave a histogram that had roughly 10 bins. You should only attempt to modify the scale argument after looking at the default plot and deciding if a modification is warranted.

Definition 3.2 A good stem and leaf plot should include a legend or key which helps us interpret the plot. R does this as we see that it states The decimal point is at the | so that 22|9 can be quickly interpreted as 22.9.

Example 3.11 (Using a Stem and Leaf Legend) To see how to interpret this in general, we also look at mtcars$disp.

mtcars$disp

##  [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
## [13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
## [25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Knowing what values we are looking at, we can construct a stem and leaf plot.

stem( mtcars$disp )

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   0 | 7888
##   1 | 012224
##   1 | 556677
##   2 | 3
##   2 | 6888
##   3 | 002
##   3 | 5566
##   4 | 04
##   4 | 67

Knowing that the first value in the dataset is 160.0 we look for this value in the plot. The legend says The decimal point is 2 digit(s) to the right of the |. This means that the stem of 1 relates to values between 100 and 199 and the leaves indicate the value in the tens place once rounded to the nearest ten, if needed. That is, we expect 160.0 to be shown as 1|6 and we indeed see a leaf of 6 after the (second) stem of 1. The max value in mtcars$disp is 472.0 which would round to 470 which therefore appears as the pair 4|7.

To make a stem and leaf plot of a vector x, you use the code:

#Will only run if x is defined
stem( x )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
stem( df$Col )

Example 3.12 (The Width argument) The width argument defaults to 80 which is usually suitable for most plots. However, the user can define the width argument can be defined as to change the number of leaves we see on each row. For example, we can use the built in dataset ChickWeight. We begin by looking at the structure of ChickWeight.

str( ChickWeight )

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"

We see that ChickWeight has a very complex structure! We fill focus our attention to the weight variable and look at a stem and leaf plot of it and setting width = 120.

stem( ChickWeight$weight, width = 120 )

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    2 | 599999999
##    4 | 000001111111111111111111122222222222222233334566788888888999999999999900000011111111222233333444555556666777
##    6 | 0011111112222222233333444445555566667777788888890011111122222233333444444446667778889999
##    8 | 00112223344444455555566777788999990001223333566666788888889
##   10 | 0000111122233333334566667778889901122223445555667789
##   12 | 00002223333344445555667788890113444555566788889
##   14 | 11123444455556666677788890011234444555666777777789
##   16 | 00002233334444466788990000134445555789
##   18 | 12244444555677782225677778889999
##   20 | 0123444555557900245578
##   22 | 0012357701123344556788
##   24 | 08001699
##   26 | 12344569259
##   28 | 01780145
##   30 | 355798
##   32 | 12712
##   34 | 1
##   36 | 13

This produces a plot that (likely) scrolls your screen to display the whole thing. We can compare this to the default settings.

stem( ChickWeight$weight )

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    2 | 599999999
##    4 | 00000111111111111111111112222222222222223333456678888888899999999999+38
##    6 | 00111111122222222333334444455555666677777888888900111111222222333334+8
##    8 | 00112223344444455555566777788999990001223333566666788888889
##   10 | 0000111122233333334566667778889901122223445555667789
##   12 | 00002223333344445555667788890113444555566788889
##   14 | 11123444455556666677788890011234444555666777777789
##   16 | 00002233334444466788990000134445555789
##   18 | 12244444555677782225677778889999
##   20 | 0123444555557900245578
##   22 | 0012357701123344556788
##   24 | 08001699
##   26 | 12344569259
##   28 | 01780145
##   30 | 355798
##   32 | 12712
##   34 | 1
##   36 | 13

We see on the stem of 4 that we have so many leaves that the final 38 values are simply recorded as +38 at the end of the line.

stem( ChickWeight$weight, scale = 2 )

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    3 | 599999999
##    4 | 00000111111111111111111112222222222222223333456678888888899999999999
##    5 | 000000111111112222333334445555566667778888899999
##    6 | 001111111222222223333344444555556666777778888889
##    7 | 0011111122222233333444444446667778889999
##    8 | 0011222334444445555556677778899999
##    9 | 0001223333566666788888889
##   10 | 00001111222333333345666677788899
##   11 | 01122223445555667789
##   12 | 0000222333334444555566778889
##   13 | 0113444555566788889
##   14 | 1112344445555666667778889
##   15 | 0011234444555666777777789
##   16 | 0000223333444446678899
##   17 | 0000134445555789
##   18 | 1224444455567778
##   19 | 2225677778889999
##   20 | 01234445555579
##   21 | 00245578
##   22 | 00123577
##   23 | 01123344556788
##   24 | 08
##   25 | 001699
##   26 | 12344569
##   27 | 259
##   28 | 0178
##   29 | 0145
##   30 | 35579
##   31 | 8
##   32 | 127
##   33 | 12
##   34 | 1
##   35 | 
##   36 | 1
##   37 | 3

While the user is certainly at liberty to change the values of scale and width for the function stem, here at Statypus we encourage someone to do so only if the default stem and leaf plot has some shortcoming that they are looking to overcome.

3.4 Skew

In this section we will discuss the Skew of a distribution. However, before we begin, it is important to point out the distinction between the histograms we have been looking at and the underlying shape of the distribution. This concept is examined in more detail in Section 6.5.1, but we will boil that section down to the figure below dealing with the weight (in ounces) of Snickers bars.

Figure 3.27: Histograms and its Underlying Distribution

When we discuss the shape of a distribution, we are really discussing the shape of the underlying “density” curve associated to it. (See Section 6.5.1 if you want to jump down that rabbit hole early.) That is to say, the entire population of Snickers bars (past, present, and future) is shown by the blue curve. While this population will never be truly infinite, it is so large that we do expect to see more “curve” to the data than flat boxes. The gray boxes are (as we just saw in Section 3.3.1) the histogram of a finite sample of Snickers bars and it is fairly clear that no finite number of Snickers bars or boxes in a histogram could produce a truly “smooth curve” like we see in blue above. However, this is the model that we use to describe the distribution.

However, not all data is as symmetric as the distribution of Snickers bars we see above. For example, the following plot shows data similar to the distribution of U.S. household incomes.

Figure 3.28: Non-symmetric Distribution Showing Right Skew

In 2024, the middle (this is the mean which we will define in Section 4.1.2 if you want to take a look) U.S. household income was approximately $80,000 and that is represented by the vertical red line in the above plot. At least at a simplistic view, it is not possible to have a negative income, so our values are “boxed” into the positive (technically non-negative) region. This means there is only a region of $80,000 to the left of the red line, but it is clearly possible to make more than $80,000 more than the red line or even multiple times that. For example, a family that has an annual income of $400,000 would sit at the far right of the part of the graph shown. Any family making more than $400,000 a year would literally be “off the chart” and is not shown.

Both of these distributions have a single “peak” to them which is sometimes called the “mode” of the distribution. Here at Statypus, we have chosen to avoid diving into the technicalities of the mode or modes of a distribution and will simply discuss modes as being the “peaks” of distributions like the blue curves shown above.

Therefore we will say that a Unimodal Distribution is a distribution where there is only a single peak on the graph. An example of a distribution that is not unimodal is shown below. Since it has two peaks, it is common to call this a Bimodal Distribution, but we will not spend any time getting into those weeds here.

Figure 3.29: Bimodal Distribution

Definition 3.3 While a completely technical definition will not be given here, we will say that a unimodal distribution is Skewed Right if the distribution takes on values further to the right of the peak of the graph than it does to the left.

Similarly, a unimodal distribution is said to be Skewed Left if it takes on values further to the left of the peak of the graph than it does to the right.

We will also use the notation of Right Skewed and Left Skewed when it seems appropriate.

The distribution shown in Figure 3.28 is an example of a right skewed distribution.

Example 3.13 There are a lot of distributions that are right skewed. A lot of variables, such as household income discussed earlier, cannot take on negative values and thus can stretch further to the right than it can to the left.

Non-contrived left skewed data is actually more rare. However, we offer a unique example here. We begin by downloading another dataset about births, but this one is much larger than BabyData1. We will call this new dataset BabyData3³⁵

BabyData3 <- read.csv( "https://statypus.org/files/BigBabyData.csv" )

We begin by looking at the structure³⁶ of the data.

str( BabyData3[ , 1:9 ] )

## 'data.frame':    101400 obs. of  9 variables:
##  $ SEX    : int  2 2 2 1 1 1 1 2 2 1 ...
##  $ MARITAL: int  1 2 1 1 2 1 2 2 1 1 ...
##  $ FAGE   : int  33 19 33 25 21 21 29 23 27 30 ...
##  $ MAGE   : int  34 18 31 28 20 21 32 21 26 22 ...
##  $ GAINED : int  26 40 16 40 60 30 20 41 0 30 ...
##  $ VISITS : int  10 10 14 15 13 15 11 15 12 10 ...
##  $ FEDUC  : int  12 11 16 12 12 12 6 13 10 12 ...
##  $ MEDUC  : int  4 12 16 12 14 13 6 13 13 14 ...
##  $ WEEKS  : int  35 41 39 38 40 42 39 41 38 39 ...

This shows that BabyData3 contains data on over 100,000 different births. We will further investigate BabyData3 in Chapter 14, but will focus on only two variables here: FAGE and `MAGE. They represent the age of the baby’s father and mother respectively.

Consider the following histogram.

data <-  BabyData3$MAGE[ BabyData3$FAGE == 40 ]   
bin_width <- 1              
#Do not change any values below this comment
hist( 
      data, 
      freq = TRUE, 
      breaks = seq( 
                      from = min( data, na.rm = TRUE ) - 0.5, 
                      to = max( data, na.rm = TRUE ) + bin_width - 0.5, 
                      by = bin_width
                  )
    )

abline( v = median( data ), col = "red")  #Draws the red line

Figure 3.30: Left Skewed Distribution

What we see here is a histogram of the ages of mothers of babies whose father was 40 at the time of their birth. The red line indicates the middle value (again the median) of ages of the mothers.

This seems to say something about 40 year old men…

We will see how the concept of skew affects the statistics we get from samples in Section 4.5.

We can visualize skew with an exploration.

Review

Definitions

Section 3.1

Definition 3.1

A Pareto Chart is a bar plot where the bars are ordered in descending order from left to right. A Pareto Chart can be be stated in terms of either observed frequency or relative frequency.

Section 3.3

Definition 3.2

A good stem and leaf plot should include a legend or key which helps us interpret the plot. R does this as we see that it states The decimal point is at the | so that 22|9 can be quickly interpreted as 22.9.

Section 3.4

Definition 3.3

While a completely technical definition will not be given here, we will say that a unimodal distribution is Skewed Right if the distribution takes on values further to the right of the peak of the graph than it does to the left.

Similarly, a unimodal distribution is said to be Skewed Left if it takes on values further to the left of the peak of the graph than it does to the right.

Important Alerts

Section 3.1

Unless you are working with already tabulated data, that is data already in the form of a table, you will need to pair barplot with table.

Section 3.3

“There are three kinds of lies: lies, damned lies, and statistics.”³⁷

While slightly allegorical, it does point out that statistics can be used to deceive people with the subtleness needed to fully understand statistical graphs and calculations.

Important Remarks

Section 3.1

Further, most R professionals would use an advanced graphical package such as ggplot2 to produce better graphical summaries of data.

We still need to include table to get the relative frequency bar plot.
The graphical output of a frequency and relative frequency bar plot are identical except for the labels on the vertical axis.
The proportions function can be used with table with or without the barplot function. The composition function, proportions( table( x ) ) will produce a relative frequency table.

It is important to remember that although the graphs may appear different when we change the arguments of barplot, the actual data has not changed and the differences are purely aesthetic. It is the job of the reader to always be a critical consumer and make sure they understand the graph they are viewing.
As mentioned before, there are plenty of options available to make the above graphs better and avoid things like the awkward legend position above. However, this document will focus on trying to produce simple graphs.

Section 3.3

Section 3.4

We will also use the notation of Right Skewed and Left Skewed when it seems appropriate.

The distribution shown in Figure 3.28 is an example of a right skewed distribution.

Code Templates

Section 3.1

If you have a vector, x, or column of a dataframe, df$Col, of values of a qualitative variable, we can make a bar plot of the frequencies it via:

#Will only run if x is defined
barplot( table( x ) )

#Will only run if df is appropriately defined
barplot( table( df$Col ) )

If we want relative frequencies instead, we use the following code:

#Will only run if x is defined
barplot( proportions( table( x ) ) )

#Will only run if df is appropriately defined
barplot( proportions( table( df$Col ) ) )

To get a Pareto chart of a qualitative variable, use the following code.

#Will only run if x is defined
barplot( sort( table( x ), decreasing = TRUE ) )

#Will only run if df is appropriately defined
barplot( sort( table( df$Col ), decreasing = TRUE ) )

You can also include the function proportions in between sort and table to make the Pareto chart display relative frequencies.

Section 3.2

To get a Two Way Table of counts based on the columns Col1 and Col2 of a dataframe, df we use the code:

#Will only run if df is appropriately defined
table( df$Col1, df$Col2 )

To get a Comparative Bar Plot based on counts based on the columns Col1 and Col2 of a dataframe, df we use the code:

#Will only run if df is appropriately defined
barplot( table( df$Col1, df$Col2 ), beside = TRUE, legend.text = TRUE )

Section 3.3

To make a basic histogram of a vector x, you can use the code:

#Will only run if x is defined
hist( x, freq )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
hist( df$Col, freq )

Where we set freq = FALSE if we want relative frequencies rather than just frequencies. We can set freq = TRUE or omit freq if we want regular frequencies.

#Data must be supplied
data <- x                 
#Value must be given 
bin_width <- w             
#Do not change any values below this comment
hist( 
      data, 
      freq = TRUE, 
      breaks = seq( 
                      from = min( data, na.rm = TRUE ) - 0.5, 
                      to = max( data, na.rm = TRUE ) + bin_width - 0.5, 
                      by = bin_width
                  )
    )

To make a stem and leaf plot of a vector x, you use the code:

#Will only run if x is defined
stem( x )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
stem( df$Col )

Exercises

For Exercise 3.2, load the HistData package. See Example 1.14 in Section 1.8 for instructions for installing HistData if you do not have it on your machine. If you already have it installed, you should only have to run library( HistData ).

For Exercise 3.3, you will need to load the dataset Asylum1849. The code for this can be found in the Now it’s your turn! which appears immediately following Example 3.1.

Exercise 3.1 Would a bar plot or histogram be the appropriate graphical display for the values in mtcars$wt? Explain.

Exercise 3.2 Consider the dataset DrinksWages in the HistData package. Run ?DrinksWages and look at the meaning of the class column. Make a table and relative frequency bar plot of this variable. Which class is the least represented and what proportion of all observations does it represent?

Exercise 3.3 Create a Pareto chart for the married.single variable in the Asylum1849 data frame.

Exercise 3.4 The data set ToothGrowth deals with tooth growth in guinea pigs. Make a histogram of the values in the len column for only those guinea pigs who received orange juice.

Exercise 3.5 Again using the ToothGrowth data set, make a histogram of the values in the len column for only those guinea pigs who received ascorbic acid.

Exercise 3.6 Based on the results of Exercises 3.4 and 3.5, can you see any difference in the value of len based on the delivery method used?

Exercise 3.7 Using ToothGrowth, give a histogram of the values in len for each of the different dose levels, there should be three total histograms. Does it appear the dose level has an effect on the len variable?

Exercise 3.8 In Section 3.3.3 there was a discrepancy between some of the stems and the associated bins in the histogram of the data mtcars$qsec. List out the values in the stems and compare them to the values in the mtcars$qsec. What is causing the discrepancy between the histogram and stem and leaf plot?

Fauna of Australia. Vol. 1b. Australian Biological Resources Study (ABRS)↩︎
See Wikipedia for more information.↩︎
To be precise, PlatypusData1 is a part of the dataset titled ID_Pop_Platypus.csv which can be found on GitHub.↩︎
We will dive into this much more later, but the curious reader can see the note following Example 4.16 in Section 4.4.1 or the footnote in Example 5.8 in Section 5.4.↩︎
The curious reader may wonder what happened to BabyData2 or if even such a dataset exists. It does, but we won’t see that dataset until Chapter 14 where we investigate all three of the baby datasets.↩︎
The observant reader may be curious as to the reason for the option 1:9 in the following line of code. The entire dataframe actually contains 37 variables for each observation and we included the option 1:9 to only have R output the first 9 variables.↩︎
See Wikipedia for more information.↩︎