R Basics¶

From Simple addition to data frames, graphs and cleaning data sets¶

Basic Building Blocks¶

5+7

x<-5+7 #x equals to five plus seven 
x

y<-x-3
y

Any object that contains data is called a data structure and numeric vectors are the simplest type of data structure in R.¶

The easiest way to create a vector is with the c() function, which stands for 'concatenate' or 'combine'.¶

z<-c(1,2.2,3)
z

#for help use ?funcation_name will give the documentation
?c

#combining the vectors 
c(z,555,z)

#operations on vector 
z*2+100

Other common arithmetic operators are `+`, `-`, `/`, and `^` (where x^2 means 'x squared'). To take the square root, use the sqrt() function and to take the absolute value, use the abs() function.¶

my_sqrt<-sqrt(z-1)
my_sqrt

my_div<-z/my_sqrt
my_div

When given two vectors of the same length, R simply performs the specified arithmetic operation (`+`, `-`, `*`, etc.) element-by-element. If the vectors are of different lengths, R 'recycles' the shorter vector until it is the same length as the longer vector.¶

#example
c(1,2,3,4)+c(0,10)

#incase longer vetor is not a multiple of shorter vector
c(1,2,3,4)+c(0,10,100)

Warning message in c(1, 2, 3, 4) + c(0, 10, 100):
“longer object length is not a multiple of shorter object length”

Workspace and Files¶

#Determine which directory your R session is using as its current working directory using getwd().
getwd()

# List all the objects in your local workspace using ls()
ls()

#List all the files in your working directory using list.files() or dir().
list.files()

dir()

#Using the args() function on a function name is also a handy way to see what arguments a function can take.
args(list.files)

function (path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, 
    recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, 
    no.. = FALSE) 
NULL

#Use dir.create() to create a directory in the current working directory called "testdir".
dir.create("testdir")

#Set your working directory to "testdir" with the setwd() command.
setwd("testdir")

#Create a file in your working directory called "mytest.R" using the file.create() function.
file.create("mytest.R")

list.files()

#Check to see if "mytest.R" exists in the working directory using the file.exists() function.
file.exists("mytest.R")

#Access information about the file "mytest.R" by using file.info().
file.info("mytest.R")

#You can use the $ operator --- e.g., file.info("mytest.R")$mode --- to grab specific items.

#Change the name of the file "mytest.R" to "mytest2.R" by using file.rename().
file.rename("mytest.R","mytest2.R")

#Make a copy of "mytest2.R" called "mytest3.R" using file.copy().
file.copy("mytest2.R","mytest3.R")

#Provide the relative path to the file "mytest3.R" by using file.path().
file.path("mytest3.R")

#You can use file.path to construct file and directory paths that are independent of the operating system
#your R code is running on. Pass 'folder1' and 'folder2' as arguments to file.path to make a
# platform-independent pathname.
file.path("folder1","folder2")

# Create a directory in the current working directory called "testdir2" and a subdirectory for it called
# "testdir3", all in one command by using dir.create() and file.path().
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)

# To delete a directory you need to use the recursive = TRUE argument with the function unlink(). If you
# don't use recursive = TRUE, R is concerned that you're unaware that you're deleting a directory and all
# of its contents. R reasons that, if you don't specify that recursive equals TRUE, you don't know that
# something is in the directory you're trying to delete. R tries to prevent you from making a mistake.
unlink("testdir2") #dosen't work

unlink("testdir2", recursive = TRUE) #works

getwd()

setwd('/Users/prashanth/DS-14.310x')

getwd()

list.files()

#Delete the 'testdir' directory that you just left (and everything in it)
unlink("testdir", recursive = TRUE)

Sequences of Numbers¶

#The simplest way to create a sequence of numbers in R is by using the `:` operator. Type 1:20 to see how it works.
1:20

pi:10

15:1

#Documentation for operators, Pull up the documentation for `:` now.
?`:`

# Often, we'll desire more control over a sequence we're creating than what the `:` operator gives us. The
# seq() function serves this purpose.
seq(1,20)

seq(1,10, by=0.5) #by half increments

my_seq<-seq(5, 10,length=30)
my_seq#30 breaks between numbers

length(my_seq)

1:length(my_seq)

seq(along.with=my_seq)

seq_along(my_seq)

#One more function related to creating sequences of numbers is rep(), which stands for 'replicate'. Let's 
#look at a few uses.
rep(0, times= 40)

rep(c(0,1,2), times = 10)#repeating vector

rep(c(0, 1, 2), each = 10) # repeat each number 10 times

Vectors¶

The simplest and most common data structure in R is the vector. Vectors come in two different flavors: atomic vectors and lists. An atomic vector contains exactly one data type, whereas a list may contain multiple data types.¶

Types of atomic vectors include logical, character, integer, and complex. Logical vectors can contain the values TRUE, FALSE, and NA (for 'not available'). These values are generated as the result of logical 'conditions'.¶

num_vect<-c(0.5,55,-10,6)
tf <- num_vect < 1
tf

num_vect>=6

The `<` and `>=` symbols in these examples are called 'logical operators'. Other logical operators include `>`, `<=`, `==` for exact equality, and `!=` for inequality.¶

If we have two logical expressions, A and B, we can ask whether at least one is TRUE with A | B (logical 'or' a.k.a. 'union') or whether they are both TRUE with A & B (logical 'and' a.k.a. 'intersection'). Lastly, !A is the negation of A and is TRUE when A is FALSE and vice versa.¶

(3 > 5) & (4 == 4)

(TRUE == TRUE) | (TRUE == FALSE)

((111 >= 111) | !(TRUE)) & ((4 + 1) == 5)

# Create a character vector that contains the following words: "My", "name", "is". Remember to enclose each
# word in its own set of double quotes, so that R knows they are character strings. Store the vector in a
# variable called my_char.
my_char<-c("My","name","is")
my_char

paste(my_char, collapse = " ")#combines the strings in a vector

my_name=c(my_char,"chika chika slam shady")#string concatination
my_name

paste(my_name, collapse = " ")

paste("Hello", "world!", sep = " ")

paste(1:3,c("X","Y","Z"),sep="") #integrs and charactors

#Try paste(LETTERS, 1:4, sep = "-"), where LETTERS is a predefined variable in R 
# containing a character vector of all 26 letters in the English alphabet.
paste(LETTERS, 1:4, sep = "-")

Note: If the lenghts of the vectors are not equal the the shorter vector repates¶

Missing Values¶

Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness.¶

In R, NA is used to represent any value that is 'not available' or 'missing' (in the statistical sense). In this lesson, we'll explore missing values further. Any operation involving NA generally yields NA as the result.¶

x<-c(44,NA,5,NA)
x*3

y <- rnorm(1000) # vector containing 1000 draws from a standard normal distribution
z<- rep(NA, 1000) # vector of NA's
my_data <- sample(c(y,z), 100) #collecting random 100 sample from both the vectors
my_na <- is.na(my_data) #TRUE if value is  NA else FALSE
my_na

my_data == NA # wont work , just gives NA's of vector lenght. Careful !

underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0.¶

sum(my_na) #sum will give us how many TRUE

#let's look at a second type of missing value -- NaN, which stands for 'not a number'.
0/0

Inf-Inf #Inf stands for infinity

Subsetting Vectors¶

In this lesson, we'll see how to extract elements from a vector based on some conditions that we specify.¶

x <- rep(c(NA,2.5,NA,-1),10) #sample vector
x

#The way you tell R that you want to select some particular elements
#(i.e. a 'subset') from a vector is by placing an 'index vector' in
#square brackets immediately following the name of the vector.

x[1:10] #first 10 elements

# Index vectors come in four different flavors -- logical vectors, vectors
# of positive integers, vectors of negative integers, and vectors of
# character strings

x[is.na(x)] # gives all NA's in vector

y<-x[!is.na(x)]
y #!is.na() is used - negation '!'. Gives all non NA elements

y[y>0] #all y values where y>0

x[!is.na(x) & x>0] # combination of above commands

Many programming languages use what's called 'zero-based indexing', which means that the first element of a vector is considered element 0. R uses 'one-based indexing', which (you guessed it!) means the first element of a vector is considered element 1.¶

x[1] #is the 1st element

x[c(3,4,7)] #3rd 4th and 7th element

x[0] #gives nothing

x[3000] #gives NA hence be carful about the lenght of the vector

x[c(-2,-10)] #gives all the elemnts other than 2nd and 10th

x[-c(2,10)] #similar to above command

vect <- c(foo = 11, bar= 2, norf=NA) #named index vectors
vect

names(vect) #gives all the names

vect2 <- c(11,2,NA) #creating the vector 
names(vect2) <- c("foo","bar","norf") #assigning the names
vect2

identical(vect,vect2) #checks for identical vectors

vect["bar"] #selecting based on name

vect[c("foo","bar")] #multiple selection based on name

Matrices and Data Frames¶

In this lesson, we'll cover matrices and data frames. Both represent 'rectangular' data types, meaning that they are used to store tabular data, with rows and columns.¶

# | The main difference, as you'll see, is that matrices can only contain a
# | single class of data, while data frames can consist of many different
# | classes of data.

my_vector <- 1:20
my_vector

dim(my_vector) #vector has no dimensions

NULL

length(my_vector) #but it has length

dim(my_vector)<- c(4,5) #assigning dimensions 4 rows and 5 column
my_vector

dim(my_vector)

attributes(my_vector)

class(my_vector) #now its type matrix

my_matrix <- my_vector
my_matrix2 = matrix(data=1:20, nrow= 4, ncol= 5) #another way of creating the matrix
my_matrix2

identical(my_matrix,my_matrix2)

patients<- c("Bill","Gina","Kelly","Sean")
cbind(patients,my_matrix) #converts every element in the matrix to string which is not good for working with numbers.
                            #This is called 'implicit coercion', because we didn't ask for it.

#Hence better way to do it use data frames
my_data <- data.frame(patients,my_matrix)
my_data

# Behind the scenes, the data.frame() function takes any number of
#| arguments and returns a single object of class `data.frame` that is
#| composed of the original objects.
class(my_data)

cnames <- c("patient","age","weight","bp","rating","test")
colnames(my_data) <- cnames #adding column name to data frame
my_data

Looking at Data¶

# | Whenever you're working with a new dataset, the first thing you should
# | do is look at it! What is the format of the data? What are the
# | dimensions? What are the variable names? How are the variables stored?
# | Are there missing data? Are there any flaws in the data?

laliga=read.csv("SP1.csv")

ls()

class(laliga) #object type

dim(laliga) #dimensions

nrow(laliga) # number of rows

ncol(laliga) #number of columns

object.size(laliga) #size of the file interms of space occupied on machine

177552 bytes

names(laliga) #column names

head(laliga) #first 6 rows deafult

head(laliga,10) #first 10 rows

tail(laliga,15) #last 15 rows

summary(laliga) #summary!!!

  Div            Date           HomeTeam         AwayTeam        FTHG      
 SP1:380   12/05/18:  8   Alaves    : 19   Alaves    : 19   Min.   :0.000  
           19/05/18:  6   Ath Bilbao: 19   Ath Bilbao: 19   1st Qu.:0.750  
           01/04/18:  5   Ath Madrid: 19   Ath Madrid: 19   Median :1.000  
           01/10/17:  5   Barcelona : 19   Barcelona : 19   Mean   :1.547  
           03/03/18:  5   Betis     : 19   Betis     : 19   3rd Qu.:2.000  
           05/11/17:  5   Celta     : 19   Celta     : 19   Max.   :7.000  
           (Other) :346   (Other)   :266   (Other)   :266                  
      FTAG       FTR          HTHG             HTAG        HTR    
 Min.   :0.000   A:115   Min.   :0.0000   Min.   :0.0000   A: 93  
 1st Qu.:0.000   D: 86   1st Qu.:0.0000   1st Qu.:0.0000   D:159  
 Median :1.000   H:179   Median :0.0000   Median :0.0000   H:128  
 Mean   :1.147           Mean   :0.6605   Mean   :0.4868          
 3rd Qu.:2.000           3rd Qu.:1.0000   3rd Qu.:1.0000          
 Max.   :6.000           Max.   :5.0000   Max.   :3.0000          
                                                                  
       HS              AS             HST              AST        
 Min.   : 2.00   Min.   : 1.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:10.00   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.: 2.000  
 Median :13.00   Median :10.00   Median : 4.500   Median : 3.000  
 Mean   :13.53   Mean   :10.47   Mean   : 4.758   Mean   : 3.805  
 3rd Qu.:16.00   3rd Qu.:13.00   3rd Qu.: 6.000   3rd Qu.: 5.000  
 Max.   :30.00   Max.   :24.00   Max.   :14.000   Max.   :13.000  
                                                                  
       HF              AF              HC               AC        
 Min.   : 4.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:11.00   1st Qu.:11.00   1st Qu.: 4.000   1st Qu.: 2.000  
 Median :13.00   Median :14.00   Median : 5.000   Median : 4.000  
 Mean   :13.73   Mean   :13.95   Mean   : 5.613   Mean   : 4.192  
 3rd Qu.:17.00   3rd Qu.:17.00   3rd Qu.: 7.000   3rd Qu.: 6.000  
 Max.   :29.00   Max.   :29.00   Max.   :16.000   Max.   :14.000  
                                                                  
       HY              AY              HR               AR         
 Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :2.000   Median :3.000   Median :0.0000   Median :0.00000  
 Mean   :2.339   Mean   :2.676   Mean   :0.1105   Mean   :0.07895  
 3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :8.000   Max.   :9.000   Max.   :2.0000   Max.   :2.00000  
                                                                   
     B365H            B365D            B365A             BWH        
 Min.   : 1.050   Min.   : 2.790   Min.   : 1.170   Min.   : 1.050  
 1st Qu.: 1.617   1st Qu.: 3.290   1st Qu.: 2.600   1st Qu.: 1.650  
 Median : 2.075   Median : 3.500   Median : 3.700   Median : 2.100  
 Mean   : 2.777   Mean   : 4.259   Mean   : 5.192   Mean   : 2.744  
 3rd Qu.: 2.790   3rd Qu.: 4.330   3rd Qu.: 5.500   3rd Qu.: 2.750  
 Max.   :17.000   Max.   :15.000   Max.   :34.000   Max.   :14.500  
                                                                    
      BWD              BWA              IWH              IWD        
 Min.   : 2.950   Min.   : 1.180   Min.   : 1.070   Min.   : 3.050  
 1st Qu.: 3.300   1st Qu.: 2.600   1st Qu.: 1.650   1st Qu.: 3.300  
 Median : 3.600   Median : 3.700   Median : 2.100   Median : 3.500  
 Mean   : 4.278   Mean   : 5.204   Mean   : 2.721   Mean   : 4.161  
 3rd Qu.: 4.330   3rd Qu.: 5.500   3rd Qu.: 2.700   3rd Qu.: 4.200  
 Max.   :15.500   Max.   :34.000   Max.   :15.000   Max.   :12.000  
                                                                    
      IWA              LBH              LBD              LBA        
 Min.   : 1.200   Min.   : 1.050   Min.   : 2.900   Min.   : 1.170  
 1st Qu.: 2.600   1st Qu.: 1.610   1st Qu.: 3.250   1st Qu.: 2.575  
 Median : 3.500   Median : 2.050   Median : 3.500   Median : 3.600  
 Mean   : 5.041   Mean   : 2.742   Mean   : 4.152   Mean   : 5.375  
 3rd Qu.: 5.300   3rd Qu.: 2.750   3rd Qu.: 4.200   3rd Qu.: 5.500  
 Max.   :27.000   Max.   :19.000   Max.   :17.000   Max.   :41.000  
                  NA's   :1        NA's   :1        NA's   :1       
      PSH              PSD              PSA              WHH        
 Min.   : 1.050   Min.   : 3.020   Min.   : 1.180   Min.   : 1.060  
 1st Qu.: 1.660   1st Qu.: 3.410   1st Qu.: 2.670   1st Qu.: 1.665  
 Median : 2.120   Median : 3.705   Median : 3.845   Median : 2.100  
 Mean   : 2.857   Mean   : 4.539   Mean   : 5.522   Mean   : 2.738  
 3rd Qu.: 2.850   3rd Qu.: 4.455   3rd Qu.: 5.942   3rd Qu.: 2.750  
 Max.   :19.650   Max.   :20.380   Max.   :36.500   Max.   :17.000  
                                                                    
      WHD              WHA              VCH              VCD        
 Min.   : 2.900   Min.   : 1.170   Min.   : 1.040   Min.   : 3.000  
 1st Qu.: 3.250   1st Qu.: 2.600   1st Qu.: 1.650   1st Qu.: 3.400  
 Median : 3.500   Median : 3.550   Median : 2.100   Median : 3.700  
 Mean   : 4.092   Mean   : 5.041   Mean   : 2.762   Mean   : 4.416  
 3rd Qu.: 4.200   3rd Qu.: 5.500   3rd Qu.: 2.800   3rd Qu.: 4.400  
 Max.   :15.000   Max.   :26.000   Max.   :15.000   Max.   :17.000  
                                                                    
      VCA             Bb1X2           BbMxH            BbAvH       
 Min.   : 1.180   Min.   : 3.00   Min.   : 1.080   Min.   : 1.050  
 1st Qu.: 2.630   1st Qu.:35.00   1st Qu.: 1.700   1st Qu.: 1.640  
 Median : 3.700   Median :37.00   Median : 2.200   Median : 2.090  
 Mean   : 5.472   Mean   :37.71   Mean   : 2.966   Mean   : 2.743  
 3rd Qu.: 5.750   3rd Qu.:40.00   3rd Qu.: 2.882   3rd Qu.: 2.765  
 Max.   :36.000   Max.   :43.00   Max.   :19.650   Max.   :16.300  
                                                                   
     BbMxD            BbAvD            BbMxA            BbAvA       
 Min.   : 3.110   Min.   : 2.940   Min.   : 1.210   Min.   : 1.170  
 1st Qu.: 3.478   1st Qu.: 3.328   1st Qu.: 2.728   1st Qu.: 2.607  
 Median : 3.750   Median : 3.570   Median : 3.920   Median : 3.665  
 Mean   : 4.636   Mean   : 4.261   Mean   : 6.107   Mean   : 5.190  
 3rd Qu.: 4.553   3rd Qu.: 4.272   3rd Qu.: 6.105   3rd Qu.: 5.543  
 Max.   :20.380   Max.   :15.320   Max.   :67.000   Max.   :33.420  
                                                                    
      BbOU          BbMx.2.5        BbAv.2.5       BbMx.2.5.1   
 Min.   : 3.00   Min.   :1.130   Min.   :1.120   Min.   :1.470  
 1st Qu.:31.75   1st Qu.:1.667   1st Qu.:1.617   1st Qu.:1.780  
 Median :34.00   Median :1.960   Median :1.880   Median :2.000  
 Mean   :34.06   Mean   :1.950   Mean   :1.872   Mean   :2.284  
 3rd Qu.:37.00   3rd Qu.:2.203   3rd Qu.:2.120   3rd Qu.:2.402  
 Max.   :42.00   Max.   :3.080   Max.   :2.850   Max.   :7.000  
                                                                
   BbAv.2.5.1         BbAH           BbAHh            BbMxAHH     
 Min.   :1.410   Min.   : 1.00   Min.   :-3.2500   Min.   :1.610  
 1st Qu.:1.718   1st Qu.:17.00   1st Qu.:-0.7500   1st Qu.:1.890  
 Median :1.920   Median :18.00   Median :-0.2500   Median :1.985  
 Mean   :2.162   Mean   :18.16   Mean   :-0.4059   Mean   :1.988  
 3rd Qu.:2.283   3rd Qu.:19.00   3rd Qu.: 0.0625   3rd Qu.:2.070  
 Max.   :5.970   Max.   :24.00   Max.   : 2.0000   Max.   :2.420  
                                                                  
    BbAvAHH         BbMxAHA         BbAvAHA           PSCH       
 Min.   :1.580   Min.   :1.680   Min.   :1.630   Min.   : 1.060  
 1st Qu.:1.840   1st Qu.:1.897   1st Qu.:1.850   1st Qu.: 1.640  
 Median :1.930   Median :1.970   Median :1.930   Median : 2.120  
 Mean   :1.938   Mean   :1.988   Mean   :1.937   Mean   : 2.839  
 3rd Qu.:2.020   3rd Qu.:2.080   3rd Qu.:2.030   3rd Qu.: 2.980  
 Max.   :2.340   Max.   :2.520   Max.   :2.440   Max.   :18.700  
                                                 NA's   :1       
      PSCD             PSCA       
 Min.   : 2.930   Min.   : 1.160  
 1st Qu.: 3.410   1st Qu.: 2.590  
 Median : 3.700   Median : 3.850  
 Mean   : 4.508   Mean   : 5.695  
 3rd Qu.: 4.560   3rd Qu.: 6.095  
 Max.   :18.500   Max.   :46.000  
 NA's   :1        NA's   :1

table(laliga$HomeTeam) #table for column Home Team

     Alaves  Ath Bilbao  Ath Madrid   Barcelona       Betis       Celta 
         19          19          19          19          19          19 
      Eibar     Espanol      Getafe      Girona   La Coruna  Las Palmas 
         19          19          19          19          19          19 
    Leganes     Levante      Malaga Real Madrid     Sevilla    Sociedad 
         19          19          19          19          19          19 
   Valencia  Villarreal 
         19          19

str(laliga) #structure if data

'data.frame':	380 obs. of  64 variables:
 $ Div       : Factor w/ 1 level "SP1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Date      : Factor w/ 137 levels "01/03/18","01/04/18",..: 75 75 83 83 83 90 90 90 97 97 ...
 $ HomeTeam  : Factor w/ 20 levels "Alaves","Ath Bilbao",..: 13 19 6 10 17 2 4 11 14 15 ...
 $ AwayTeam  : Factor w/ 20 levels "Alaves","Ath Bilbao",..: 1 12 18 3 8 9 5 16 20 7 ...
 $ FTHG      : int  1 1 2 2 1 0 2 0 1 0 ...
 $ FTAG      : int  0 0 3 2 1 0 0 3 0 1 ...
 $ FTR       : Factor w/ 3 levels "A","D","H": 3 3 1 2 2 2 3 1 3 1 ...
 $ HTHG      : int  1 1 1 2 1 0 2 0 0 0 ...
 $ HTAG      : int  0 0 1 0 1 0 0 2 0 0 ...
 $ HTR       : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 2 3 1 2 2 ...
 $ HS        : int  16 22 16 13 9 12 15 12 14 10 ...
 $ AS        : int  6 5 13 9 9 8 3 16 9 13 ...
 $ HST       : int  9 6 5 6 4 2 2 6 3 4 ...
 $ AST       : int  3 4 6 3 6 2 0 8 1 6 ...
 $ HF        : int  14 25 12 15 14 16 16 16 18 16 ...
 $ AF        : int  18 13 11 15 12 15 15 12 14 15 ...
 $ HC        : int  4 5 5 6 7 7 8 4 11 3 ...
 $ AC        : int  2 2 4 0 3 6 0 4 6 7 ...
 $ HY        : int  0 3 3 2 2 1 2 5 1 2 ...
 $ AY        : int  1 3 1 4 4 3 1 1 3 3 ...
 $ HR        : int  0 0 0 0 1 0 0 0 0 0 ...
 $ AR        : int  0 1 0 1 0 1 0 1 0 0 ...
 $ B365H     : num  2.05 1.75 2.38 8 1.62 1.5 1.17 9.5 3.25 2.1 ...
 $ B365D     : num  3.2 3.8 3.25 4.33 4 4 8 5.75 3.25 3.3 ...
 $ B365A     : num  4.1 4.5 3.2 1.45 5.5 7.5 15 1.3 2.3 3.7 ...
 $ BWH       : num  2.05 1.75 2.4 7.5 1.62 1.48 1.18 9.25 3.25 2.15 ...
 $ BWD       : num  3.1 3.9 3.3 4.33 3.9 4.25 7.5 5.75 3.2 3.3 ...
 $ BWA       : num  4.1 4.6 3 1.45 5.75 7 14.5 1.3 2.3 3.5 ...
 $ IWH       : num  2.1 1.75 2.5 7.2 1.55 1.5 1.17 7.5 3.3 2.1 ...
 $ IWD       : num  3.4 3.6 3.3 4.4 4 4.2 7.5 5.5 3.35 3.4 ...
 $ IWA       : num  3.5 4.8 2.85 1.45 6.2 6.5 15 1.35 2.2 3.5 ...
 $ LBH       : num  2.05 1.75 2.35 7.5 1.6 1.5 1.2 9.5 3.25 2.1 ...
 $ LBD       : num  3 3.8 3.25 4 3.9 4 6.5 5.25 3.1 3.1 ...
 $ LBA       : num  4.2 4.33 3 1.5 5.5 7 15 1.3 2.3 3.4 ...
 $ PSH       : num  2.03 1.78 2.44 8.36 1.62 ...
 $ PSD       : num  3.25 4.01 3.4 4.38 4.17 4.37 7.35 5.79 3.24 3.36 ...
 $ PSA       : num  4.52 4.83 3.16 1.49 6.18 7.31 15.5 1.33 2.36 3.49 ...
 $ WHH       : num  2.05 1.8 2.4 8 1.67 1.5 1.22 11 3.1 2.2 ...
 $ WHD       : num  3.1 3.75 3.4 4.2 3.6 4 6 4.5 3.1 3.3 ...
 $ WHA       : num  4 4.2 2.9 1.44 5.5 7 13 1.33 2.4 3.3 ...
 $ VCH       : num  2.05 1.8 2.4 7.5 1.65 1.5 1.2 9.5 3.25 2.15 ...
 $ VCD       : num  3.2 4 3.4 4.3 4 4.2 7 5.75 3.25 3.3 ...
 $ VCA       : num  4.4 4.6 3.13 1.5 5.75 7 13 1.3 2.3 3.5 ...
 $ Bb1X2     : int  35 35 35 35 35 34 35 35 34 34 ...
 $ BbMxH     : num  2.12 1.83 2.5 8.36 1.69 ...
 $ BbAvH     : num  2.03 1.77 2.39 7.53 1.63 1.5 1.19 9.68 3.26 2.18 ...
 $ BbMxD     : num  3.4 4.04 3.5 4.4 4.17 4.4 8 5.86 3.35 3.4 ...
 $ BbAvD     : num  3.15 3.86 3.32 4.17 3.93 4.17 7.11 5.44 3.17 3.26 ...
 $ BbMxA     : num  4.52 4.83 3.2 1.51 6.2 7.5 17 1.35 2.4 3.7 ...
 $ BbAvA     : num  4.17 4.46 3.01 1.48 5.58 ...
 $ BbOU      : int  31 33 34 34 33 32 27 27 32 32 ...
 $ BbMx.2.5  : num  2.84 1.69 2.03 2.2 1.81 2.01 1.44 1.5 2.42 2.25 ...
 $ BbAv.2.5  : num  2.68 1.64 1.98 2.11 1.75 1.94 1.4 1.46 2.36 2.14 ...
 $ BbMx.2.5.1: num  1.53 2.4 1.9 1.8 2.14 1.96 3.1 2.95 1.63 1.76 ...
 $ BbAv.2.5.1: num  1.46 2.27 1.84 1.74 2.09 1.87 2.88 2.64 1.58 1.7 ...
 $ BbAH      : int  18 16 18 16 16 17 17 16 15 17 ...
 $ BbAHh     : num  -0.5 -0.75 -0.25 1.25 -1 -1 -2 1.5 0.25 -0.25 ...
 $ BbMxAHH   : num  2.07 2.05 2.08 1.77 2.12 1.9 2.05 2.03 1.93 1.92 ...
 $ BbAvAHH   : num  2.03 1.97 2.05 1.75 2.06 1.86 2 1.98 1.89 1.88 ...
 $ BbMxAHA   : num  1.9 1.96 1.87 2.25 1.86 2.05 1.91 1.95 2.03 2.04 ...
 $ BbAvAHA   : num  1.86 1.91 1.83 2.16 1.82 2.01 1.86 1.89 1.98 1.99 ...
 $ PSCH      : num  1.98 1.78 2.12 6.93 1.64 1.53 1.2 12.4 3.31 2.2 ...
 $ PSCD      : num  3.35 4.24 3.53 3.83 4.18 4.48 8.25 7 3.32 3.27 ...
 $ PSCA      : num  4.63 4.43 3.74 1.63 5.82 6.91 15.2 1.26 2.4 3.85 ...

Base Graphics¶

data(cars)

head(cars)

options(repr.plot.width=4, repr.plot.height=4) #reduce the size of the graph , other wise fills up the screen
plot(cars) #choose first column as x axis and second for y

plot(x=cars$speed, y=cars$dist) #specifying the axis

plot(y=cars$speed, x=cars$dist) #swtiching the axis from above

plot(x=cars$speed, y=cars$dist, xlab= "Speed") #labelling the x-axis

plot(x=cars$speed, y=cars$dist, xlab= "Speed",ylab = "Stopiing Distance") #labeling y-axis

plot(cars,main="My Plot") #title

plot(cars,sub="My Plot Subtitle") #sub title

plot(cars,col=2) #change color for the points

plot(cars,xlim=c(10,15)) #limiting the x-axis

plot(cars,pch=2) #chaning point icon

data(mtcars) #loding data-mtcars

str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

head(mtcars)

boxplot(mpg ~ cyl , data=mtcars) #box plot

hist(mtcars$mpg) #histogram

Manipulating Data with dplyr¶

dplyr is a fast and powerful R package written by Hadley Wickham and Romain Francois that provides a consistent and concise grammar for manipulating tabular data.¶

library("dplyr") #loading the package

packageVersion("dplyr") #check the version

[1] ‘0.7.4’

mydf=read.csv("SP1.csv") #reading the data set to mydf

cran<-tbl_df(mydf) #"The main advantage to using a tbl_df over a regular data frame is the printing."

cran #jupyter notebook dosen't show tbl_df well

This is how it looks in R for example¶

Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().¶

head(select(cran,HomeTeam,AwayTeam,FTAG,FTHG)) #select columns needed , note the order  specefied is maintained

head(select(cran,HomeTeam:FTR)) #selects all column from HomeTeam to FTR

head(select(cran,FTR:HomeTeam)) #also possible in reverse order

head(select(cran,HomeTeam:FTR, -FTAG)) # -column name dosent select specefied column name

cran_sub<-select(cran, -(HS:PSCA), -Div)#removes all the columns from HTR to PSCA
head(cran_sub)

"How do I select a subset of rows?" That's where the filter() function comes in.¶

head(filter(cran_sub, HomeTeam == "Barcelona")) #only rows were HomeTeam is Barcelona

filter(cran_sub, HomeTeam == "Barcelona", FTR== "D") # rows where HomeTeam is Barcelona and FTR is D (draw at home)

filter(cran_sub, HomeTeam == "Barcelona", FTHG>3) # adding logical operators
                                                    # rows where HomeTeam is Barcelona and HTHG is more than 3 
                                                    #(time Barcelona scored more than 3 goals at home)

head(filter(cran_sub, AwayTeam == "Barcelona" | HomeTeam =="Barcelona")) #where rows either home or away team is 
                                                                        #Barcelona

filter(cran_sub, AwayTeam == "Barcelona", HTHG>HTAG , FTAG>=FTHG ) #Barcelona away game trailing at half time 
                                                                    # but won the game or draw full before full time

filter(cran_sub, is.na(FTHG)) #no missing values in FTHG column

head(filter(cran_sub, !is.na(FTHG))) #adding !is.na() will remove all NAs in the rows.

arrange() is use to sort the columns¶

head(arrange(cran_sub, FTHG)) #arranges by FTHG values assending

head(arrange(cran_sub, desc(FTHG))) #desc()  sorts by decending

head(arrange(cran_sub,  HomeTeam, desc(FTHG))) #first sorts HomeTeam ascending and then FTHG by desending

It's common to create a new variable based on the value of one or more variables already in a dataset. The mutate() function does exactly this.¶

cran_GD <- mutate(cran_sub, GD = FTHG-FTAG)
head(cran_GD)                               #creats new column GD = FTHG - FTAG

#similary can add , subtract multiply and divide value to columns and creat new columns

summarize()¶

summarise(cran_sub, AHG = mean(FTHG)) #gives you summary of the column 
                                     #average home goal

summarise(cran_sub, AAG = mean(FTAG)) #average away goal

summarise(cran_GD, AGD = mean(abs(GD))) #average goal diff

Grouping and Chaining with dplyr¶

library("dplyr")

mydf=read.csv("SP1.csv")

cran<-tbl_df(mydf)
cran_sub<-select(cran, -(HS:PSCA), -Div)
head(cran_sub)

summarise(cran_sub, count =n())

by_team <- group_by(cran_sub, HomeTeam) # group by very important function  for data analysis
team_sum = summarise(by_team, count =n(), unique = n_distinct(FTHG), avg_hg = mean(FTHG))
head(team_sum)
#all team play 19 home game, with unique home goals and their avg home goals

n() - gives count , n_distinct() - gives unique¶

# | We need to know the value of 'count' that splits the data into
# | the top 1% and bottom 99% of packages based on total
# | downloads. In statistics, this is called the 0.99, or 99%,
# | sample quantile. Use quantile(pack_sum$count, probs = 0.99) to
# | determine this number.

quantile(team_sum$avg_hg, probs = 0.90) #2.5 and above is top 90%

filter(team_sum, avg_hg >2.5) #only RM and FCB are more than 90%

View(team_sum) to view the table, not yet supported in the Jupyter R kernel¶

arrange(filter(team_sum, avg_hg >2.5), desc(avg_hg)) #sorting

# | In this script, we've used a special chaining operator, %>%,
# | which was originally introduced in the magrittr R package and
# | has now become a key component of dplyr. You can pull up the
# | related documentation with ?chain. The benefit of %>% is that
# | it allows us to chain the function calls in a linear fashion.
# | The code to the right of %>% operates on the result from the
# | code to the left of %>%.

same above operation done using %>% .¶

cran_sub %>% group_by(HomeTeam) %>%
summarise(count =n(), unique = n_distinct(FTHG), avg_hg = mean(FTHG)) %>% 
filter(avg_hg >2.5) %>% 
arrange(desc(avg_hg)) 

#1. group by HomeTeam
#2. summariese data
#3. filter based on condition
#4. arrange
#with out saving the varibale and in linear fastion

Div	Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR	⋯	BbAv.2.5.1	BbAH	BbAHh	BbMxAHH	BbAvAHH	BbMxAHA	BbAvAHA	PSCH	PSCD	PSCA
SP1	18/08/17	Leganes	Alaves	1	0	H	1	0	H	⋯	1.46	18	-0.50	2.07	2.03	1.90	1.86	1.98	3.35	4.63
SP1	18/08/17	Valencia	Las Palmas	1	0	H	1	0	H	⋯	2.27	16	-0.75	2.05	1.97	1.96	1.91	1.78	4.24	4.43
SP1	19/08/17	Celta	Sociedad	2	3	A	1	1	D	⋯	1.84	18	-0.25	2.08	2.05	1.87	1.83	2.12	3.53	3.74
SP1	19/08/17	Girona	Ath Madrid	2	2	D	2	0	H	⋯	1.74	16	1.25	1.77	1.75	2.25	2.16	6.93	3.83	1.63
SP1	19/08/17	Sevilla	Espanol	1	1	D	1	1	D	⋯	2.09	16	-1.00	2.12	2.06	1.86	1.82	1.64	4.18	5.82
SP1	20/08/17	Ath Bilbao	Getafe	0	0	D	0	0	D	⋯	1.87	17	-1.00	1.90	1.86	2.05	2.01	1.53	4.48	6.91

	Div	Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR	⋯	BbAv.2.5.1	BbAH	BbAHh	BbMxAHH	BbAvAHH	BbMxAHA	BbAvAHA	PSCH	PSCD	PSCA
366	SP1	12/05/18	La Coruna	Villarreal	2	4	A	0	3	A	⋯	2.15	17	0.25	2.19	2.11	1.81	1.76	4.71	4.30	1.71
367	SP1	12/05/18	Real Madrid	Celta	6	0	H	3	0	H	⋯	3.94	19	-1.50	1.96	1.91	2.00	1.95	1.25	6.89	11.56
368	SP1	12/05/18	Sociedad	Leganes	3	2	H	2	1	H	⋯	2.04	19	-1.25	2.06	1.99	1.91	1.87	1.40	4.87	9.26
369	SP1	13/05/18	Espanol	Malaga	4	1	H	3	1	H	⋯	1.74	19	-0.75	1.86	1.83	2.07	2.03	1.63	3.97	6.24
370	SP1	13/05/18	Levante	Barcelona	5	4	H	2	1	H	⋯	3.23	18	1.50	2.11	2.06	1.86	1.81	7.70	5.40	1.40
371	SP1	19/05/18	Celta	Levante	4	2	H	2	1	H	⋯	2.82	20	-0.75	2.05	2.01	1.90	1.85	1.60	4.68	5.33
372	SP1	19/05/18	Las Palmas	Girona	1	2	A	1	2	A	⋯	2.36	19	0.25	1.94	1.90	2.00	1.95	3.44	4.01	2.06
373	SP1	19/05/18	Leganes	Betis	3	2	H	1	1	D	⋯	2.19	17	0.00	2.04	1.99	1.91	1.86	2.41	3.76	2.90
374	SP1	19/05/18	Malaga	Getafe	0	1	A	0	0	D	⋯	1.91	19	0.25	1.98	1.92	1.98	1.93	3.26	3.56	2.28
375	SP1	19/05/18	Sevilla	Alaves	1	0	H	1	0	H	⋯	2.86	19	-1.25	1.97	1.91	2.00	1.94	1.32	6.09	9.47
376	SP1	19/05/18	Villarreal	Real Madrid	2	2	D	0	2	A	⋯	3.79	19	0.25	2.05	1.98	1.93	1.87	4.74	5.05	1.62
377	SP1	20/05/18	Ath Bilbao	Espanol	0	1	A	0	1	A	⋯	2.06	17	-0.50	2.06	2.02	1.88	1.85	1.95	3.77	4.05
378	SP1	20/05/18	Ath Madrid	Eibar	2	2	D	1	1	D	⋯	1.98	19	-1.00	2.09	2.04	1.87	1.82	1.47	4.25	8.80
379	SP1	20/05/18	Barcelona	Sociedad	1	0	H	0	0	D	⋯	5.04	19	-2.00	1.94	1.89	2.03	1.97	1.31	6.40	8.60
380	SP1	20/05/18	Valencia	La Coruna	2	1	H	1	0	H	⋯	2.98	19	-1.50	2.01	1.97	1.94	1.89	1.27	6.44	10.71

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR
02/12/17	Barcelona	Celta	2	2	D	1	1	D
11/02/18	Barcelona	Getafe	0	0	D	0	0	D
06/05/18	Barcelona	Real Madrid	2	2	D	1	1	D

Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR
09/09/17	Barcelona	Espanol	5	0	H	2	0	H
19/09/17	Barcelona	Eibar	6	1	H	2	0	H
17/12/17	Barcelona	La Coruna	4	0	H	2	0	H
24/02/18	Barcelona	Girona	6	1	H	4	1	H
09/05/18	Barcelona	Villarreal	5	1	H	3	0	H

Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR
16/09/17	Getafe	Barcelona	1	2	A	1	0	H
14/10/17	Ath Madrid	Barcelona	1	1	D	1	0	H
14/01/18	Sociedad	Barcelona	2	4	A	2	1	H
31/03/18	Sevilla	Barcelona	2	2	D	1	0	H

Date	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HTHG	HTAG	HTR
21/01/18	Real Madrid	La Coruna	7	1	H	2	1	H
19/09/17	Barcelona	Eibar	6	1	H	2	0	H
13/01/18	Girona	Las Palmas	6	0	H	1	0	H
24/02/18	Barcelona	Girona	6	1	H	4	1	H
18/03/18	Real Madrid	Girona	6	3	H	1	1	D
12/05/18	Real Madrid	Celta	6	0	H	3	0	H

HomeTeam	count	unique	avg_hg
Alaves	19	4	1.105263
Ath Bilbao	19	3	1.000000
Ath Madrid	19	5	1.578947
Barcelona	19	7	2.789474
Betis	19	5	1.842105
Celta	19	5	1.789474

patients
Bill	1	5	9	13	17
Gina	2	6	10	14	18
Kelly	3	7	11	15	19
Sean	4	8	12	16	20

patients	X1	X2	X3	X4	X5
Bill	1	5	9	13	17
Gina	2	6	10	14	18
Kelly	3	7	11	15	19
Sean	4	8	12	16	20

patient	age	weight	bp	rating	test
Bill	1	5	9	13	17
Gina	2	6	10	14	18
Kelly	3	7	11	15	19
Sean	4	8	12	16	20

speed	dist
4	2
4	10
7	4
7	22
8	16
9	10

R Basics¶

From Simple addition to data frames, graphs and cleaning data sets¶

Basic Building Blocks¶

Any object that contains data is called a data structure and numeric vectors are the simplest type of data structure in R.¶

The easiest way to create a vector is with the c() function, which stands for 'concatenate' or 'combine'.¶

Other common arithmetic operators are +, -, /, and ^ (where x^2 means 'x squared'). To take the square root, use the sqrt() function and to take the absolute value, use the abs() function.¶

When given two vectors of the same length, R simply performs the specified arithmetic operation (+, -, *, etc.) element-by-element. If the vectors are of different lengths, R 'recycles' the shorter vector until it is the same length as the longer vector.¶

Workspace and Files¶

Sequences of Numbers¶

Vectors¶

The simplest and most common data structure in R is the vector. Vectors come in two different flavors: atomic vectors and lists. An atomic vector contains exactly one data type, whereas a list may contain multiple data types.¶

Types of atomic vectors include logical, character, integer, and complex. Logical vectors can contain the values TRUE, FALSE, and NA (for 'not available'). These values are generated as the result of logical 'conditions'.¶

The < and >= symbols in these examples are called 'logical operators'. Other logical operators include >, <=, == for exact equality, and != for inequality.¶

If we have two logical expressions, A and B, we can ask whether at least one is TRUE with A | B (logical 'or' a.k.a. 'union') or whether they are both TRUE with A & B (logical 'and' a.k.a. 'intersection'). Lastly, !A is the negation of A and is TRUE when A is FALSE and vice versa.¶

Note: If the lenghts of the vectors are not equal the the shorter vector repates¶

Missing Values¶

Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness.¶

In R, NA is used to represent any value that is 'not available' or 'missing' (in the statistical sense). In this lesson, we'll explore missing values further. Any operation involving NA generally yields NA as the result.¶

underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0.¶

Subsetting Vectors¶

In this lesson, we'll see how to extract elements from a vector based on some conditions that we specify.¶

Many programming languages use what's called 'zero-based indexing', which means that the first element of a vector is considered element 0. R uses 'one-based indexing', which (you guessed it!) means the first element of a vector is considered element 1.¶

Matrices and Data Frames¶

In this lesson, we'll cover matrices and data frames. Both represent 'rectangular' data types, meaning that they are used to store tabular data, with rows and columns.¶

Looking at Data¶

Base Graphics¶

Manipulating Data with dplyr¶

dplyr is a fast and powerful R package written by Hadley Wickham and Romain Francois that provides a consistent and concise grammar for manipulating tabular data.¶

This is how it looks in R for example¶

Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().¶

"How do I select a subset of rows?" That's where the filter() function comes in.¶

arrange() is use to sort the columns¶

It's common to create a new variable based on the value of one or more variables already in a dataset. The mutate() function does exactly this.¶

summarize()¶

Grouping and Chaining with dplyr¶

n() - gives count , n_distinct() - gives unique¶

View(team_sum) to view the table, not yet supported in the Jupyter R kernel¶

same above operation done using %>% .¶

Other common arithmetic operators are `+`, `-`, `/`, and `^` (where x^2 means 'x squared'). To take the square root, use the sqrt() function and to take the absolute value, use the abs() function.¶

When given two vectors of the same length, R simply performs the specified arithmetic operation (`+`, `-`, `*`, etc.) element-by-element. If the vectors are of different lengths, R 'recycles' the shorter vector until it is the same length as the longer vector.¶

The `<` and `>=` symbols in these examples are called 'logical operators'. Other logical operators include `>`, `<=`, `==` for exact equality, and `!=` for inequality.¶