2018-12-07

What is R?

R in a nutshell


  • a free software for data analysis
  • an interpreted programming language, derived from `S-plus’
  • initially developed by R. Ihaka and R. Gentleman (1996)
  • currently developed by the R Core Team (20 people)
  • largest collection of tools for data analysis (1,000s of contributors)

Where can you get it?

What can you do with it?

  • basic statistics: statistical tests, linear modelling, multivariate analysis
  • spatial statistics: GIS, mapping, clustering
  • graph theory: social sciences, network analysis, graph algorithm
  • genetics: phylogenetic trees, genetic markers, genomics
  • epidemiology!
  • much more: see task views cran.r-project.org/web/views/

What does “free” mean?


  • Freedom = ability to make informed decisions
  • you don’t pay for it
  • the code is accessible by anyone
  • anyone can use, modify and share the code

Getting started


  • get R for your system (download from CRAN)
  • get a Graphical User Interface (GUI): RStudio, emacs + ESS, Tinn-R
  • (or at least) get a text editor to write code: notepad++, emacs, vi, Tinn-R, …

And then…

Getting help

Storing data in R

How does R store information?

  • no files, all in the RAM (i.e. temporary memory)
  • data, results, functions, etc. are all R objects
  • one object can be saved / loaded using saveRDS/readRDS (output: .rds files)
  • several objects can be saved / loaded using save/load (output: .RData files)
  • an entire session can be saved using save.image

How to create objects?

General syntax: object_name <- content:

toto <- 1:8
toto # check content: 1, 2, 3, ...
## [1] 1 2 3 4 5 6 7 8
toto <- "some text"
toto # content has changed
## [1] "some text"

Round numbers: integer

a <- 1:10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
class(a)
## [1] "integer"

Decimal numbers: numeric

b <- c(-0.1, 10.123, pi)
b
## [1] -0.100000 10.123000  3.141593
class(b)
## [1] "numeric"

Text: character

a <- c("hello world", "this is fun", "even more fun coming")
a
## [1] "hello world"          "this is fun"          "even more fun coming"
class(a)
## [1] "character"

Categorical variables: factor

a <- factor(c("red", "blue", "green", "red", "green"))
a
## [1] red   blue  green red   green
## Levels: blue green red
class(a)
## [1] "factor"
levels(a)
## [1] "blue"  "green" "red"

Booleans: logical

The logical type can be TRUE or FALSE:

a <- c(TRUE, FALSE, TRUE, TRUE)
a
## [1]  TRUE FALSE  TRUE  TRUE
class(a)
## [1] "logical"

Vectors

A vector stores several values of the same type as a one-dimensional array:

a <- c(1, 2, 10, -1, 1.123)
a
## [1]  1.000  2.000 10.000 -1.000  1.123
length(a)
## [1] 5

Matrices

A matrix stores several values of the same type as a table:

a <- matrix(sample(1:12), ncol = 4)
a
##      [,1] [,2] [,3] [,4]
## [1,]   12    7    8    4
## [2,]   11   10    2    5
## [3,]    1    6    9    3
class(a)
## [1] "matrix"
dim(a)
## [1] 3 4

Data frames

A data.frame is a table where variables (columns) can have different types (equivalent to a spreadsheet):

a <- data.frame(age = c(10, 54, 3), sex = c("m", "f", "m"))
a
##   age sex
## 1  10   m
## 2  54   f
## 3   3   m
class(a)
## [1] "data.frame"
dim(a)
## [1] 3 2

Lists

A list is a collection of objects of any types and sizes, stored as different slots:

age <- c(10, 54, 3)
sex <- factor(c("m", "f", "m"))
swab  <- matrix(
  sample(c("+", "-"), replace = TRUE, 10), nrow = 2,
  dimnames = list(NULL, paste("t", 1:5, sep = "")))
x <- list(age = age, gender = sex, swab_results = swab)

Lists (continued)

x
## $age
## [1] 10 54  3
## 
## $gender
## [1] m f m
## Levels: f m
## 
## $swab_results
##      t1  t2  t3  t4  t5 
## [1,] "+" "+" "-" "+" "+"
## [2,] "+" "+" "+" "-" "+"
class(x)
## [1] "list"
length(x)
## [1] 3

Summary: basic object types`

  • integer: integer numbers
  • numeric: decimal numbers
  • character: character strings
  • factor: categorical variables
  • vector: collection of values (1 dimensional array, same type)
  • matrix: collection of values (table, same type)
  • data.frame: columns can have different types, but same length (spreadsheet)
  • list: collection of elements, no restriction of content
  • ...: classes for: DNA sequences, maps, networks, etc.

Using functions

What is a function?

A set of operations made on a given input.

Syntax: function_name(argument1, argument2, ...)

Example:

rnorm(8, mean = 5, sd = 3)
## [1]  2.713883  7.373878  9.364017  1.173302  6.436040  5.624428  7.107626
## [8] -1.727200

How to use a function?

Read the documentation by typing ?function_name.

Example:

?rnorm

Functions results need storing

For example:

rnorm(5, mean = 3, sd = 2) # this result is not stored
## [1] -1.133996  7.153785  3.201488  1.897321  1.427742
toto <- rnorm(5, mean = 3, sd = 2) # store output in toto
toto
## [1] -0.04287389  7.92735044  3.46023461  0.21262429  3.52734135

Functions can be nested

For example:

hist(rnorm(1000, mean = 3, sd = 2), col = rainbow(15))

Handling objects

Subsetting objects


Objects can be subsetted by index, name, or logical, using:

  • object_name[] for a vector
  • object_name[rows, columns] for a matrix / data.frame
  • object_name[[]] for a list

Subsetting a vector: by index

x[foo] where foo is an integer vector

  • positive integers: position of retained entries
  • negative integers: position of discarded entries
x <- 10:1
x
##  [1] 10  9  8  7  6  5  4  3  2  1
x[c(1, 2, 5)]
## [1] 10  9  6
letters[2:10]
## [1] "b" "c" "d" "e" "f" "g" "h" "i" "j"
letters[-(1:20)]
## [1] "u" "v" "w" "x" "y" "z"

Subsetting a vector: logicals

x[foo] where foo is a logical vector, returns the values of x where foo is TRUE:

x <- 1:10
x < 5
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x[x < 5]
## [1] 1 2 3 4

Subsetting a vector: names

x[foo] where x is named and foo is a character vector of retained names:

x <- sample(1:100, 10)
names(x) <- letters[1:10]
x
##   a   b   c   d   e   f   g   h   i   j 
##  92 100   8  53   7  22  35  49   2  71
x[c("c", "d", "a", "i")]
##  c  d  a  i 
##  8 53 92  2

Subsetting and replacing values

x[foo] <- new.value where x[foo] are values to be replaced with new.value:

x <- round(rnorm(14), 2)
x
##  [1] -1.32  0.12 -0.78  0.68  1.52 -0.06  0.95 -1.97  1.33 -1.42 -0.40
## [12]  2.35 -0.05  0.88
x[x< 0] <- 0
x
##  [1] 0.00 0.12 0.00 0.68 1.52 0.00 0.95 0.00 1.33 0.00 0.00 2.35 0.00 0.88

Subsetting matrices and data frames

Same principle, with x[lines, columns]:

x <- matrix(1:15, nrow = 3)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    2    5    8   11   14
## [3,]    3    6    9   12   15
x[c(3,1), ]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    3    6    9   12   15
## [2,]    1    4    7   10   13
x[2, 4:5]
## [1] 11 14

Subsetting lists

x[foo] to return a list, x[[foo]] for a single element:

x <- list(a = rnorm(4), hi = "Hello", stuff = letters[1:10]); x
## $a
## [1] -0.03960258  1.20769218 -0.64288920  1.63578167
## 
## $hi
## [1] "Hello"
## 
## $stuff
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
x[c(1,3)]
## $a
## [1] -0.03960258  1.20769218 -0.64288920  1.63578167
## 
## $stuff
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
x[[3]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
x$stuff
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Using logical operations

Logical operations 1/4

a <- c(TRUE, TRUE, FALSE, FALSE)
b <- c(TRUE, FALSE, TRUE, FALSE)
rbind(a,b)
##   [,1]  [,2]  [,3]  [,4]
## a TRUE  TRUE FALSE FALSE
## b TRUE FALSE  TRUE FALSE
a & b # logical 'AND'
## [1]  TRUE FALSE FALSE FALSE
a | b # logical 'OR'
## [1]  TRUE  TRUE  TRUE FALSE

Logical operations 2/4

!a # not A
## [1] FALSE FALSE  TRUE  TRUE
any(a) # at least one TRUE
## [1] TRUE
all(a) # all TRUE
## [1] FALSE
which(a) # indices of the TRUEs
## [1] 1 2

Logical operations 3/4

is.na is useful to spot missing data (NAs)

a <- c(3,NA,2,5,NA,10)
is.na(a)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
a[!is.na(a)]
## [1]  3  2  5 10

Logical operations 4/4

Logicals are numbers: TRUE = 1, FALSE = 0

a <- -10:5
sum(a > 0)
## [1] 5
mean(a > 0)
## [1] 0.3125
a * (a > 0)
##  [1] 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5