5  Data Structure

This section is still under development

In the Chapter 3, we learned about different data types in R (numbers, text, logical values, etc.). Just as we organize physical items - R needs special containers to organize and store different types of data. These containers are called data structures.

Each data structure in R has specific characteristics:

Before exploring each data structure, we can use these functions to inspect any container:

5.1 Atomic Vectors

5.1.1 What is a Vector?

An atomic vector is the simplest data structure in R. Think of it as a single row or column that can hold only one type of data. To create vectors we can pass

# Single Vector
fruit <- "papaya"

fruit
[1] "papaya"

To combine multiple values to an object, use c().

# Multivalue vector
fruits <- c("papaya", "orange", "apple", "pineapple", "grape",
            "strawberries", "avocado")

fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "avocado"     

To get the number of elements in an object use length().

length(fruit) # shows the number of elements in fruit
[1] 1
length(fruits) # shows the number of elements in fruits
[1] 7

Next we will see the different operations such as slicing, subsetting and so on which we can perform on vector objects.

5.1.2 Vector Operations

5.1.2.1 Basic Operation

Simple operations like addition, subtraction, multiplication, and other arithmetic are part of the operations that can be performed on vectors.

# Arithmetic Operation
x <- 15
y <- 20
y + x
[1] 35
y %% x
[1] 5
y / x
[1] 1.333333

When the length of vectors is greater than one, element-wise operations can be performed on them. Each element in the longer object will be operated on by the short one. When we have objects with more than one element, the elements are operated on one-to-one if they are divisible without remainder. If the division is with remainder, we get a warning about the vectors not being multiples. Although this does not stop the operation, the operation gets recycled with the short-length object operating on the longer one with its first element. This phenomenon of starting the operation again in R is called vector recycling.

# Vector recycling
a <- 1:5 # Creates vector of 1, 2, 3, 4, 5
b <- 2
c <- 7:9 # Creates vector of 7, 8, 9

a + b
[1] 3 4 5 6 7
a / b
[1] 0.5 1.0 1.5 2.0 2.5
a / c
Warning in a/c: longer object length is not a multiple of shorter object length
[1] 0.1428571 0.2500000 0.3333333 0.5714286 0.6250000

When operating on vectors of different lengths, R will “recycle” the shorter vector’s values to match the longer vector’s length. This can be useful but also dangerous if not used carefully!

5.1.3 Accessing Vector Elements

To access a value in a vector we can get it via its index. Indexing allows us access or modify specific elements in different data structures. We use the squared-brackets [] to index the position of elements. To get the first element in the fruits vector, papaya, we specify its index position, 1, within the squared brackets.

Table 5.1: Index number of fruit vector
Vector papaya orange apple pineapple grape strawberries avocado
Index number 1 2 3 4 5 6 7
fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "avocado"     
fruits[1]
[1] "papaya"

To return multiple element we can include the elements index position within c() and pass it to the squared brackets. You can return a ranged of elements using :.

# remove elements by specific position
fruits[c(7, 5, 2)] 
[1] "avocado" "grape"   "orange" 
# remove the range of elements that fall within that number.
fruits[2:5] 
[1] "orange"    "apple"     "pineapple" "grape"    

5.1.4 Modifying Vectors

5.1.4.1 Adding Elements to a Vector

There are two ways to add new elements to a vector. First, use the append() function

fruits <- append(fruits, "banana") # adding new element
length(fruits) # The number of elements have increased by one.
[1] 8
fruits # The new elements
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "avocado"      "banana"      

Secondly, assign the new value to a new index number

fruit 
[1] "papaya"
fruit[2] <- "mango" # add new element to a new index number
fruit
[1] "papaya" "mango" 

5.1.4.2 Altering Elements in a Vector

Vectors are altered using the index number of the element to be changed.

fruits[7] # Position to be changed
[1] "avocado"
fruits[7] <- "tomato" # Replace avocado with tomato
fruits # new elements
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "tomato"       "banana"      

5.1.4.3 Removing Elements in Vectors

Subtract out the element to be removed from a vector with it’s index number.

fruits[-1] # Show values left after removing first value.
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      
fruits <- fruits[-1] # Reassign to variable to confirm the change.

To confirm an object data structure, use their is.*() variant. For vectors, is.vector(), for matrix, is.matrix() and so on. To convert from one structure to another, use their as.*() variant.

5.2 Matrices

5.2.1 What is a Matrix?

A matrix is a two-dimensional data structure that holds elements of the same type arranged in rows and columns. Think of it as a table with uniform data type. To create a matrix, use the matrix() function. Within the function, specify the number of rows, nrow or number of columns ncol

mat <- matrix(1:6, nrow = 2)
mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

The number in the matrix get filled by columns. To change this arrangement, set the byrow to TRUE

matrix(1:6, nrow = 2, byrow = TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

We can confirm if the object is a matrix by using is.matrix()

is.matrix(mat) # confirm object is a matrix
[1] TRUE

You can make a matrix from a vector by passing it to the matrix() function.

fruit_mat <- matrix(fruits, nrow = 4) # make matrix from fruits vector
Warning in matrix(fruits, nrow = 4): data length [7] is not a sub-multiple or
multiple of the number of rows [4]
fruit_mat
     [,1]        [,2]          
[1,] "orange"    "strawberries"
[2,] "apple"     "tomato"      
[3,] "pineapple" "banana"      
[4,] "grape"     "orange"      

To get the dimension of matrix use dim() function.

dim(fruit_mat)
[1] 4 2

The result [1] 4 2 is interpreted 4 by 2, i.e. four rows and two column. That matrix is thereby called a four by two matrix. Check Figure 5.1 to see an example.

Figure 5.1: matrix indexing, source: https://biocore.crg.eu

For example, Figure 5.1 shows indexing for a simple 3 * 2 matrix.

5.2.2 Matrix Operations

5.2.2.1 Basic Operations

Similar to vectors, matrix can perform arithmetic operations. For arithmetic with scalar vector, the operation is carried on each element of the matrix.

my_mat <- matrix(1:6, nrow = 3)
my_mat
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
my_mat * 5
     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30
my_mat - 15
     [,1] [,2]
[1,]  -14  -11
[2,]  -13  -10
[3,]  -12   -9
my_mat2 <- matrix(2:7, nrow = 3)
my_mat2 / my_mat
         [,1]     [,2]
[1,] 2.000000 1.250000
[2,] 1.500000 1.200000
[3,] 1.333333 1.166667

Matrix with unequal dimension cannot be added together

my_mat3 <- matrix(7:15, nrow = 3)
my_mat3
     [,1] [,2] [,3]
[1,]    7   10   13
[2,]    8   11   14
[3,]    9   12   15
my_mat3 + my_mat
Error in my_mat3 + my_mat: non-conformable arrays

5.2.2.2 Naming Rows and Columns

A property of matrix is being two dimensional. From the result printed above, we have [,1] and [1,] signifying rows and columns. These are the column and row indices of a matrix and they can replaced with names of our choosing. Using the colnames() and rownames() functions, the names of columns and rows of a matrix can be changed respectively.

# represent column names with values a and b
colnames(fruit_mat) <- letters[1:2] 

fruit_mat
     a           b             
[1,] "orange"    "strawberries"
[2,] "apple"     "tomato"      
[3,] "pineapple" "banana"      
[4,] "grape"     "orange"      
# represent the row names with uppercase A, B, C, and D
rownames(fruit_mat) <- LETTERS[1:4] 

fruit_mat
  a           b             
A "orange"    "strawberries"
B "apple"     "tomato"      
C "pineapple" "banana"      
D "grape"     "orange"      

5.2.3 Accessing Elements in a Matrix

Like with vectors, you can access any particular element in a matrix using the squared brackets, []. Since matrix is two dimensional, there’s a little adjustment in how we access elements as we have to specify rows and columns. The syntax is [row_index, column_index]. For example, let’s access the third row and second column element of fruit_mat.

fruit_mat
  a           b             
A "orange"    "strawberries"
B "apple"     "tomato"      
C "pineapple" "banana"      
D "grape"     "orange"      
fruit_mat[3, 2]
[1] "banana"

We can also access more than one element of a particular axis (row and column) by passing in a vector.

fruit_mat[c(1, 3, 2), 1] # This returns row 1, 3, and 2 elements of column 1
          A           C           B 
   "orange" "pineapple"     "apple" 
fruit_mat[2, c(1, 2)]
       a        b 
 "apple" "tomato" 

You can also use : within an axis to access elements within a range

fruit_mat[2:4, c(1, 2)] # returns row 2, 3, and 4 element of column 1 and 2.
  a           b       
B "apple"     "tomato"
C "pineapple" "banana"
D "grape"     "orange"

To return all the rows of a particular column, leave the row space within the squared bracket empty, and specify the index of the column you want to return

fruit_mat[,1]
          A           B           C           D 
   "orange"     "apple" "pineapple"     "grape" 
fruit_mat[,2]
             A              B              C              D 
"strawberries"       "tomato"       "banana"       "orange" 

To return all the columns of a particular row, the column space is left empty.

fruit_mat[3:4, ] # returns element of row 3 and 4 for all the columns.
  a           b       
C "pineapple" "banana"
D "grape"     "orange"

You can also access an element using the row and column names.

fruit_mat["A", "a"]
[1] "orange"

5.2.4 Adding New Element to a Matrix

To add new columns to a matrix use cbind() and to add rows use rbind(). We’ll create a new matrix of fruits to show this .

new_fruit_list <- matrix(c("raspberry", "blue berries", 
                           "kiwi", "clementine"),
       nrow = 4,
       dimnames = list(letters[1:4], "C"))

new_fruit_list
  C             
a "raspberry"   
b "blue berries"
c "kiwi"        
d "clementine"  
fruit_list <- cbind(fruit_mat, new_fruit_list)
fruit_list
  a           b              C             
A "orange"    "strawberries" "raspberry"   
B "apple"     "tomato"       "blue berries"
C "pineapple" "banana"       "kiwi"        
D "grape"     "orange"       "clementine"  

Ensure the number of rows of the matrices to be joined matches when performing a column bind and the number of columns matches when performing a row bind.

rbind(fruit_list, c("avocado", "pear", "lemon"))
  a           b              C             
A "orange"    "strawberries" "raspberry"   
B "apple"     "tomato"       "blue berries"
C "pineapple" "banana"       "kiwi"        
D "grape"     "orange"       "clementine"  
  "avocado"   "pear"         "lemon"       

5.2.5 Transposing Matrix

Transposing matrix is done using t(). This flips row elements to columns and column elements to rows.

fruit_mat_transposed <- t(fruit_mat)
fruit_mat_transposed
  A              B        C           D       
a "orange"       "apple"  "pineapple" "grape" 
b "strawberries" "tomato" "banana"    "orange"
dim(fruit_mat_transposed)
[1] 2 4

5.3 Data frames

5.3.1 What is a Data Frame

Data frames are two-dimensional like matrix and are the standard way to store data. They are similar to Excel spreadsheets. What makes them different from matrix is their ability to store different data types. To create a data frame, use the data.frame() function. Below is a simple data frame that stores the inventory of a fruit in a store.

fruit_inventory <- data.frame(
  type = c("pineapple", "mango", "apple"), # character data type
  stock = c(5, 3, 0), # double data type
  available = c(TRUE, TRUE, FALSE) # logical data type
)

fruit_inventory
       type stock available
1 pineapple     5      TRUE
2     mango     3      TRUE
3     apple     0     FALSE

Data frames can combine vectors into a table where each vector becomes a column. The vectors to be used have to be of the same length to avoid error:

fruit <- c("orange", "mango", "apple")
stock <- c(5, 3, 0)
available <- c(TRUE, TRUE, FALSE)

fruit_tbl <- data.frame(fruit, stock, available)
fruit_tbl
   fruit stock available
1 orange     5      TRUE
2  mango     3      TRUE
3  apple     0     FALSE

5.3.2 Accessing and Modifying Elements in a Data Frame

Elements in data.frame are accessed in a similar fashion as matrix. Data frames also uses the dollar sign, $ to access columns.

fruit_tbl$stock
[1] 5 3 0
fruit_tbl$available
[1]  TRUE  TRUE FALSE

You can apply operations you want on each variable this way. For example we can check the total stocks available.

sum(fruit_tbl$stock)
[1] 8

Using the square brackets, and index number of the column we want to remove, we can remove old columns or add new columns to a data frame.

# Removing a Column
fruit_tbl[-1]
  stock available
1     5      TRUE
2     3      TRUE
3     0     FALSE
# Adding a column
fruit_tbl["location"] <- c("online", "on site", "online")
fruit_tbl
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4 List

5.4.1 What is a List

A list is a versatile data structure that can store elements of different types and sizes - including other data structures like vectors, matrices, and even other lists. Think of it like a container that can hold different kinds of boxes.

# Basic list creation
my_list <- list(
  fruits,
  fruit_mat,
  c(TRUE, FALSE),
  FALSE, 
  fruit_mat_transposed,
  fruit_tbl
)

my_list
[[1]]
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      

[[2]]
  a           b             
A "orange"    "strawberries"
B "apple"     "tomato"      
C "pineapple" "banana"      
D "grape"     "orange"      

[[3]]
[1]  TRUE FALSE

[[4]]
[1] FALSE

[[5]]
  A              B        C           D       
a "orange"       "apple"  "pineapple" "grape" 
b "strawberries" "tomato" "banana"    "orange"

[[6]]
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4.2 Naming Objects in a List

The objects in a list can be given a name using names() function.

names(my_list) <- c("fruits", "fruit_mat", "logical_1", "logical_2",
                    "fruit_transposed_matrix", "fruit_dataframe")

On printing list now, we’ll see each object named.

my_list
$fruits
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      

$fruit_mat
  a           b             
A "orange"    "strawberries"
B "apple"     "tomato"      
C "pineapple" "banana"      
D "grape"     "orange"      

$logical_1
[1]  TRUE FALSE

$logical_2
[1] FALSE

$fruit_transposed_matrix
  A              B        C           D       
a "orange"       "apple"  "pineapple" "grape" 
b "strawberries" "tomato" "banana"    "orange"

$fruit_dataframe
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4.3 Accessing Elements in a List

The squared-bracket, [] is used to select elements in a list but to print the items in the object you use [[]]. Like data.frame you can also use the dollar sign, $,

my_list[1]
$fruits
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      

Notice the difference when we use [[]]. The result is similar to using $ to access the objects within the list

my_list[[1]]
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      
my_list$fruits # similar to [1]
[1] "orange"       "apple"        "pineapple"    "grape"        "strawberries"
[6] "tomato"       "banana"      

To access the item in each object add a square bracket in front.

my_list[[1]][3]
[1] "pineapple"

5.5 Summary

In this chapter you learned about data structures in R. You saw how to create each data structure, how to access the elements of each structure. Also, you learned how to add and remove items from each data structure. You got introduced to checking the properties of each data structure such as their length and dimensions. Next you will learn about packages in R and how to install them, after you will bring together the knowledge you’ve gathered from chapter one into making R Scripts and R projects.