- Introductions
- Course Overview
- Syllabus
- Assignments
- Homework
- Labs
- Data Project
- Final exam
- Introduction to R and RStudio
- The
DATA606
R Package - Using R Markdown
- The
- Intro to Data
September 17, 2018
DATA606
R PackageA little about me:
Syllabus and course materials are here: http://crj504.bryer.org
Date | Chapter | Topic |
---|---|---|
Aug-27 | NO CLASS | |
Sep-3 | NO CLASS: Labor Day | |
Sept-17 | NO CLASS: Rosh Hashanah | |
Sept-24 | 1 | Introduction to Stats |
Oct-1 | 1 | Introduction to Data |
Oct-8 | 2 | Probability |
Oct-15 | 3 | Distributions |
Oct-22 | 4 | Foundations for Inference |
Oct-29 | 5 | Inference for Numerical Data |
Nov-5 | 6 | Inference for Categorical Data |
Nov-12 | 7 | Linear Regression |
Nov-19 | 7 | Linear Regression |
Nov-26 | 8 | Multiple Regression |
Dec-3 | 8 | Logistic Regression |
Dec-10 | Final Exam |
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues…
R provides a wide variety of statistical (linear and non linear modeling, classical statistical tests, time-series analysis, classifcation, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. (R-project.org)
For a brief history of R, see R generation - The story of a statistical programming language that became a subcultural phenomenon (Thieme, 2018)
DATA606
R PackageThe package can be installed from Github using the devtools
package.
devtools::install_github('jbryer/DATA606')
library('DATA606')
- Load the packagevignette(package='DATA606')
- Lists vignettes in the DATA606 packagevignette('os3')
- Loads a PDF of the OpenIntro Statistics bookdata(package='DATA606')
- Lists data available in the packagegetLabs()
- Returns a list of the available labsviewLab('Lab0')
- Opens Lab0 in the default web browserstartLab('Lab0')
- Starts Lab0 (copies to getwd()), opens the Rmd fileshiny_demo()
- Lists available Shiny appsR Markdown files are provided for all the labs. You can start a lab using the DATA606::startLab
function.
However, creating new R Markdown files in RStudio can be done by clicking File
> New File
> R Markdown
.
library(readxl) library(likert) mass <- as.data.frame(read_excel('../data/MASS.xlsx')) str(mass)
## 'data.frame': 25 obs. of 15 variables: ## $ Q3 : chr "Female" "Male" "Female" "Female" ... ## $ Q2_1 : chr "Neither Agree nor Disagree" "Strongly Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" ... ## $ Q2_2 : chr "Agree" "Neither Agree nor Disagree" "Agree" "Agree" ... ## $ Q2_3 : chr "Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ... ## $ Q2_4 : chr "Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ... ## $ Q2_5 : chr "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Strongly Agree" ... ## $ Q2_6 : chr "Neither Agree nor Disagree" "Disagree" "Agree" "Strongly Agree" ... ## $ Q2_7 : chr "Agree" "Disagree" "Agree" "Strongly Agree" ... ## $ Q2_8 : chr "Agree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Strongly Agree" ... ## $ Q2_9 : chr "Agree" "Disagree" "Agree" "Strongly Agree" ... ## $ Q2_10: chr "Neither Agree nor Disagree" "Strongly Disagree" "Disagree" "Agree" ... ## $ Q2_11: chr "Neither Agree nor Disagree" "Disagree" "Disagree" "Strongly Agree" ... ## $ Q2_12: chr "Disagree" "Strongly Disagree" "Disagree" "Strongly Disagree" ... ## $ Q2_13: chr "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Disagree" ... ## $ Q2_14: chr "Neither Agree nor Disagree" "Neither Agree nor Disagree" "Agree" "Strongly Agree" ...
items <- c('I find math interesting.', 'I get uptight during math tests.', 'I think that I will use math in the future.', 'Mind goes blank and I am unable to think clearly when doing my math test.', 'Math relates to my life.', 'I worry about my ability to solve math problems.', 'I get a sinking feeling when I try to do math problems.', 'I find math challenging.', 'Mathematics makes me feel nervous.', 'I would like to take more math classes.', 'Mathematics makes me feel uneasy.', 'Math is one of my favorite subjects.', 'I enjoy learning with mathematics.', 'Mathematics makes me feel confused.') names(mass)[1] <- 'Gender' names(mass)[2:15] <- items # Recode the responses to be a factor for(i in 2:15) { mass[,i] <- factor(mass[,i], levels=c('Strongly Disagree', 'Disagree', 'Neither Agree nor Disagree', 'Agree', 'Strongly Agree'), ordered=TRUE) }
likert.out <- likert(mass[,-1]) plot(likert.out)
Qualitative Variables
Descriptive statistics:
Plot types:
Quantitative Variables
Descriptive statistics:
Plot types:
We will use the lego
R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.
devtools::install_github("seankross/lego")
library(lego) data(legosets) str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6172 obs. of 14 variables: ## $ Item_Number : chr "10246" "10247" "10248" "10249" ... ## $ Name : chr "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ... ## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## $ Theme : chr "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ... ## $ Subtheme : chr "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ... ## $ Pieces : int 2262 2464 1158 898 13 39 32 105 13 11 ... ## $ Minifigures : int 6 10 NA NA 1 2 2 3 2 2 ... ## $ Image_URL : chr "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ... ## $ GBP_MSRP : num 132.99 149.99 69.99 59.99 9.99 ... ## $ USD_MSRP : num 159.99 199.99 99.99 79.99 9.99 ... ## $ CAD_MSRP : num 200 230 120 NA 13 ... ## $ EUR_MSRP : num 149.99 179.99 89.99 69.99 9.99 ... ## $ Packaging : chr "Box" "Box" "Box" "Box" ... ## $ Availability: chr "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...
table(legosets$Availability, useNA='ifany')
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 695 2 1795 ## Promotional Promotional (Airline) Retail ## 141 12 3120 ## Retail - limited Unknown ## 403 4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
## ## Blister pack Box Box with backing card Bucket ## LEGO exclusive 45 147 0 1 ## LEGOLAND exclusive 0 2 0 0 ## Not specified 0 20 0 0 ## Promotional 0 44 0 0 ## Promotional (Airline) 0 11 0 0 ## Retail 53 2575 16 30 ## Retail - limited 2 302 1 5 ## Unknown 0 1 0 0 ## ## Canister Foil pack Loose Parts Not specified Other ## LEGO exclusive 0 0 71 7 5 ## LEGOLAND exclusive 0 0 0 0 0 ## Not specified 0 5 0 1739 0 ## Promotional 0 0 1 0 3 ## Promotional (Airline) 0 0 0 1 0 ## Retail 78 285 0 0 28 ## Retail - limited 0 1 0 0 0 ## Unknown 0 0 0 0 0 ## ## Plastic box Polybag Shrink-wrapped Tag Tub ## LEGO exclusive 1 412 0 6 0 ## LEGOLAND exclusive 0 0 0 0 0 ## Not specified 6 24 0 0 1 ## Promotional 2 90 0 0 1 ## Promotional (Airline) 0 0 0 0 0 ## Retail 0 4 18 0 33 ## Retail - limited 1 86 0 0 5 ## Unknown 0 3 0 0 0
prop.table(table(legosets$Availability))
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 0.1126053143 0.0003240441 0.2908295528 ## Promotional Promotional (Airline) Retail ## 0.0228451069 0.0019442644 0.5055087492 ## Retail - limited Unknown ## 0.0652948801 0.0006480881
barplot(table(legosets$Availability), las=3)
barplot(prop.table(table(legosets$Availability)), las=3)
library(vcd) mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82
var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976
fivenum(legosets$Pieces, na.rm=TRUE)
## [1] 0.0 30.0 82.0 256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25
summary
Functionsummary(legosets$Pieces)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 30.0 82.0 215.2 256.2 5922.0 112
psych
Packagelibrary(psych) describe(legosets$Pieces, skew=FALSE)
## vars n mean sd min max range se ## X1 1 6060 215.17 356.2 0 5922 5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
## item group1 vars n mean sd min max ## X11 1 LEGO exclusive 1 659 172.74203 442.96954 1 3428 ## X12 2 LEGOLAND exclusive 1 2 211.00000 154.14928 102 320 ## X13 3 Not specified 1 1747 145.87178 309.19929 1 5195 ## X14 4 Promotional 1 140 53.97143 108.42721 1 1000 ## X15 5 Promotional (Airline) 1 12 126.16667 47.01612 10 203 ## X16 6 Retail 1 3094 245.78119 294.78052 0 3803 ## X17 7 Retail - limited 1 402 410.94030 652.06435 1 5922 ## X18 8 Unknown 1 4 27.50000 15.96872 6 44 ## range se ## X11 3427 17.255643 ## X12 218 109.000000 ## X13 5194 7.397620 ## X14 999 9.163772 ## X15 193 13.572384 ## X16 3803 5.299546 ## X17 5921 32.522014 ## X18 38 7.984360
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
stripchart(legosets$Pieces)
par.orig <- par(mar=c(1,10,1,1)) stripchart(legosets$Pieces ~ legosets$Availability, las=1)
par(par.orig)
hist(legosets$Pieces)
With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.
hist(log(legosets$Pieces))
plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')
plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')
boxplot(legosets$Pieces)
boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = ## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn
plot(legosets$Pieces, legosets$USD_MSRP)
legosets[which(legosets$USD_MSRP >= 400),]
## # A tibble: 4 x 14 ## Item_Number Name Year Theme Subtheme Pieces Minifigures Image_URL ## <chr> <chr> <int> <chr> <chr> <int> <int> <chr> ## 1 2000430 Iden… 2013 Seri… "" NA 6 http://i… ## 2 2000431 Conn… 2013 Seri… "" 2455 NA http://i… ## 3 2000409 Wind… 2010 Seri… "" 4900 NA http://i… ## 4 10179 Ulti… 2007 Star… Ultimat… 5195 5 http://i… ## # ... with 6 more variables: GBP_MSRP <dbl>, USD_MSRP <dbl>, ## # CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>
legosets[which(legosets$Pieces >= 4000),]
## # A tibble: 4 x 14 ## Item_Number Name Year Theme Subtheme Pieces Minifigures Image_URL ## <chr> <chr> <int> <chr> <chr> <int> <int> <chr> ## 1 10214 Towe… 2010 Adva… Buildin… 4287 NA http://i… ## 2 2000409 Wind… 2010 Seri… "" 4900 NA http://i… ## 3 10189 Taj … 2008 Adva… Buildin… 5922 NA http://i… ## 4 10179 Ulti… 2007 Star… Ultimat… 5195 5 http://i… ## # ... with 6 more variables: GBP_MSRP <dbl>, USD_MSRP <dbl>, ## # CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>
plot(legosets$Pieces, legosets$USD_MSRP) bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),] text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"John Tukey