September 17, 2018

Agenda

  • Introductions
  • Course Overview
    • Syllabus
    • Assignments
      • Homework
      • Labs
      • Data Project
      • Final exam
  • Introduction to R and RStudio
    • The DATA606 R Package
    • Using R Markdown
  • Intro to Data

Introduction

A little about me:

  • Currently Executive Director at Excelsior College
    • Principal Investigator for a Department of Education Grant (part of their FIPSE First in the World program) to develop a Diagnostic Assessment and Achievement of College Skills (www.DAACS.net)
  • Authored over a dozen R packages including:
  • Specialize in propensity score methods. Three new methods/R packages developed include:

Also a Father…

And photographer.

Syllabus

Class Schedule

Date Chapter Topic
Aug-27 NO CLASS
Sep-3 NO CLASS: Labor Day
Sept-17 NO CLASS: Rosh Hashanah
Sept-24 1 Introduction to Stats
Oct-1 1 Introduction to Data
Oct-8 2 Probability
Oct-15 3 Distributions
Oct-22 4 Foundations for Inference
Oct-29 5 Inference for Numerical Data
Nov-5 6 Inference for Categorical Data
Nov-12 7 Linear Regression
Nov-19 7 Linear Regression
Nov-26 8 Multiple Regression
Dec-3 8 Logistic Regression
Dec-10 Final Exam

Assignments

  • Homework: 32%
  • Labs: 20%
  • Data Project: 30%
  • Final exam: 18%

R and RStudio

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues…

R provides a wide variety of statistical (linear and non linear modeling, classical statistical tests, time-series analysis, classifcation, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. (R-project.org)

For a brief history of R, see R generation - The story of a statistical programming language that became a subcultural phenomenon (Thieme, 2018)

The DATA606 R Package

The package can be installed from Github using the devtools package.

devtools::install_github('jbryer/DATA606')

Important Functions

  • library('DATA606') - Load the package
  • vignette(package='DATA606') - Lists vignettes in the DATA606 package
  • vignette('os3') - Loads a PDF of the OpenIntro Statistics book
  • data(package='DATA606') - Lists data available in the package
  • getLabs() - Returns a list of the available labs
  • viewLab('Lab0') - Opens Lab0 in the default web browser
  • startLab('Lab0') - Starts Lab0 (copies to getwd()), opens the Rmd file
  • shiny_demo() - Lists available Shiny apps

Using R Markdown

R Markdown files are provided for all the labs. You can start a lab using the DATA606::startLab function.

However, creating new R Markdown files in RStudio can be done by clicking File > New File > R Markdown.

Survey Results

library(readxl)
library(likert)

mass <- as.data.frame(read_excel('../data/MASS.xlsx'))
str(mass)
## 'data.frame':    25 obs. of  15 variables:
##  $ Q3   : chr  "Female" "Male" "Female" "Female" ...
##  $ Q2_1 : chr  "Neither Agree nor Disagree" "Strongly Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" ...
##  $ Q2_2 : chr  "Agree" "Neither Agree nor Disagree" "Agree" "Agree" ...
##  $ Q2_3 : chr  "Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ...
##  $ Q2_4 : chr  "Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ...
##  $ Q2_5 : chr  "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Strongly Agree" ...
##  $ Q2_6 : chr  "Neither Agree nor Disagree" "Disagree" "Agree" "Strongly Agree" ...
##  $ Q2_7 : chr  "Agree" "Disagree" "Agree" "Strongly Agree" ...
##  $ Q2_8 : chr  "Agree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Strongly Agree" ...
##  $ Q2_9 : chr  "Agree" "Disagree" "Agree" "Strongly Agree" ...
##  $ Q2_10: chr  "Neither Agree nor Disagree" "Strongly Disagree" "Disagree" "Agree" ...
##  $ Q2_11: chr  "Neither Agree nor Disagree" "Disagree" "Disagree" "Strongly Agree" ...
##  $ Q2_12: chr  "Disagree" "Strongly Disagree" "Disagree" "Strongly Disagree" ...
##  $ Q2_13: chr  "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Disagree" ...
##  $ Q2_14: chr  "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ...

Survey Results (cont.)

items <- c('I find math interesting.',
           'I get uptight during math tests.',
           'I think that I will use math in the future.',
           'Mind goes blank and I am unable to think clearly when doing my math test.',
           'Math relates to my life.',
           'I worry about my ability to solve math problems.',
           'I get a sinking feeling when I try to do math problems.',
           'I find math challenging.',
           'Mathematics makes me feel nervous.',
           'I would like to take more math classes.',
           'Mathematics makes me feel uneasy.',
           'Math is one of my favorite subjects.',
           'I enjoy learning with mathematics.',
           'Mathematics makes me feel confused.')
names(mass)[1] <- 'Gender'
names(mass)[2:15] <- items
# Recode the responses to be a factor
for(i in 2:15) {
    mass[,i] <- factor(mass[,i], levels=c('Strongly Disagree', 'Disagree',
                            'Neither Agree nor Disagree', 'Agree', 'Strongly Agree'),
                       ordered=TRUE)
}

Survey Results (cont.)

likert.out <- likert(mass[,-1])
plot(likert.out)

Types of Variables

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

Data Types in R

Dexcriptive Statistics and Visualizations

Qualitative Variables

Descriptive statistics:

  • Contingency Tables
  • Proportional Tables

Plot types:

  • Bar plot
  • Mosaic plot

Quantitative Variables

Descriptive statistics:

  • Mean
  • Median
  • Quartiles
  • Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
  • Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

  • Dot plots
  • Histograms
  • Density plots
  • Box plots
  • Scatterplots

Intro to Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")
library(lego)
data(legosets)
str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6172 obs. of  14 variables:
##  $ Item_Number : chr  "10246" "10247" "10248" "10249" ...
##  $ Name        : chr  "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ...
##  $ Pieces      : int  2262 2464 1158 898 13 39 32 105 13 11 ...
##  $ Minifigures : int  6 10 NA NA 1 2 2 3 2 2 ...
##  $ Image_URL   : chr  "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ...
##  $ GBP_MSRP    : num  132.99 149.99 69.99 59.99 9.99 ...
##  $ USD_MSRP    : num  159.99 199.99 99.99 79.99 9.99 ...
##  $ CAD_MSRP    : num  200 230 120 NA 13 ...
##  $ EUR_MSRP    : num  149.99 179.99 89.99 69.99 9.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...

Contingency Tables

table(legosets$Availability, useNA='ifany')
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   695                     2                  1795 
##           Promotional Promotional (Airline)                Retail 
##                   141                    12                  3120 
##      Retail - limited               Unknown 
##                   403                     4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##                        
##                         Blister pack  Box Box with backing card Bucket
##   LEGO exclusive                  45  147                     0      1
##   LEGOLAND exclusive               0    2                     0      0
##   Not specified                    0   20                     0      0
##   Promotional                      0   44                     0      0
##   Promotional (Airline)            0   11                     0      0
##   Retail                          53 2575                    16     30
##   Retail - limited                 2  302                     1      5
##   Unknown                          0    1                     0      0
##                        
##                         Canister Foil pack Loose Parts Not specified Other
##   LEGO exclusive               0         0          71             7     5
##   LEGOLAND exclusive           0         0           0             0     0
##   Not specified                0         5           0          1739     0
##   Promotional                  0         0           1             0     3
##   Promotional (Airline)        0         0           0             1     0
##   Retail                      78       285           0             0    28
##   Retail - limited             0         1           0             0     0
##   Unknown                      0         0           0             0     0
##                        
##                         Plastic box Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive                  1     412              0    6    0
##   LEGOLAND exclusive              0       0              0    0    0
##   Not specified                   6      24              0    0    1
##   Promotional                     2      90              0    0    1
##   Promotional (Airline)           0       0              0    0    0
##   Retail                          0       4             18    0   33
##   Retail - limited                1      86              0    0    5
##   Unknown                         0       3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1126053143          0.0003240441          0.2908295528 
##           Promotional Promotional (Airline)                Retail 
##          0.0228451069          0.0019442644          0.5055087492 
##      Retail - limited               Unknown 
##          0.0652948801          0.0006480881

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Mosaic Plot

library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976


fivenum(legosets$Pieces, na.rm=TRUE)
## [1]    0.0   30.0   82.0  256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25

The summary Function

summary(legosets$Pieces)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    82.0   215.2   256.2  5922.0     112

The psych Package

library(psych)
describe(legosets$Pieces, skew=FALSE)
##    vars    n   mean    sd min  max range   se
## X1    1 6060 215.17 356.2   0 5922  5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
##     item                group1 vars    n      mean        sd min  max
## X11    1        LEGO exclusive    1  659 172.74203 442.96954   1 3428
## X12    2    LEGOLAND exclusive    1    2 211.00000 154.14928 102  320
## X13    3         Not specified    1 1747 145.87178 309.19929   1 5195
## X14    4           Promotional    1  140  53.97143 108.42721   1 1000
## X15    5 Promotional (Airline)    1   12 126.16667  47.01612  10  203
## X16    6                Retail    1 3094 245.78119 294.78052   0 3803
## X17    7      Retail - limited    1  402 410.94030 652.06435   1 5922
## X18    8               Unknown    1    4  27.50000  15.96872   6   44
##     range         se
## X11  3427  17.255643
## X12   218 109.000000
## X13  5194   7.397620
## X14   999   9.163772
## X15   193  13.572384
## X16  3803   5.299546
## X17  5921  32.522014
## X18    38   7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)

Transformations

With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.

hist(log(legosets$Pieces))

Density Plots

plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')

Density Plot (log tansformed)

plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')

Box Plots

boxplot(legosets$Pieces)

boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn

Scatter Plots

plot(legosets$Pieces, legosets$USD_MSRP)

Examining Possible Outliers (expensive sets)

legosets[which(legosets$USD_MSRP >= 400),]
## # A tibble: 4 x 14
##   Item_Number Name   Year Theme Subtheme Pieces Minifigures Image_URL
##   <chr>       <chr> <int> <chr> <chr>     <int>       <int> <chr>    
## 1 2000430     Iden…  2013 Seri… ""           NA           6 http://i…
## 2 2000431     Conn…  2013 Seri… ""         2455          NA http://i…
## 3 2000409     Wind…  2010 Seri… ""         4900          NA http://i…
## 4 10179       Ulti…  2007 Star… Ultimat…   5195           5 http://i…
## # ... with 6 more variables: GBP_MSRP <dbl>, USD_MSRP <dbl>,
## #   CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>

Examining Possible Outliers (big sets)

legosets[which(legosets$Pieces >= 4000),]
## # A tibble: 4 x 14
##   Item_Number Name   Year Theme Subtheme Pieces Minifigures Image_URL
##   <chr>       <chr> <int> <chr> <chr>     <int>       <int> <chr>    
## 1 10214       Towe…  2010 Adva… Buildin…   4287          NA http://i…
## 2 2000409     Wind…  2010 Seri… ""         4900          NA http://i…
## 3 10189       Taj …  2008 Adva… Buildin…   5922          NA http://i…
## 4 10179       Ulti…  2007 Star… Ultimat…   5195           5 http://i…
## # ... with 6 more variables: GBP_MSRP <dbl>, USD_MSRP <dbl>,
## #   CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>

plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Source: https://en.wikipedia.org/wiki/Pie_chart.

Just say NO to pie charts!

"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"

John Tukey