5  Data Structure

This section is still under development

In the Chapter 3, we learned about different data types in R such as double, integer, character, and logical. Just as we organize physical items, and organize our personal space - R needs special containers to organize and store different types of data. These containers are called data structures.

Each data structure in R has specific characteristics:

Before exploring each data structure, we can use these functions to inspect any container:

5.1 Atomic Vectors

5.1.1 What is an Atomic Vector?

An atomic vector is the simplest data structure in R (Grolemund 2014). Think of it as a one-dimensional data that can hold only one type of data. The different data types are actually atomic vectors.

# Single Vector
fruit <- "papaya"

fruit
[1] "papaya"

We can make our vectors longer using use c() (Wickham 2019).

# Multivalue vector
fruits <- c("papaya", "orange", "apple", "pineapple", "grape",
            "strawberries", "avocado")

fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "avocado"     

To get the number of elements in an object use length().

length(fruit) # shows the number of elements in fruit
[1] 1
length(fruits) # shows the number of elements in fruits
[1] 7

Next we will see the different operations such as slicing, subsetting and so on which we can perform on vector objects.

Atomic vectors can only hold one data type at a time. In the example below, we created a new object of double data type, then combine it with the ˋfruitsˋ object we created earlier. Checking the data type of the new object ˋnew_dataˋ, only a data type is returned.

set.seed(123)
diameter <- round(rnorm(10, mean = 23.5, sd = 0.6), 2)
typeof(diameter)
[1] "double"
new_data <- c(fruits, diameter)
new_data
 [1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
 [6] "strawberries" "avocado"      "23.16"        "23.36"        "24.44"       
[11] "23.54"        "23.58"        "24.53"        "23.78"        "22.74"       
[16] "23.09"        "23.23"       
typeof(new_data)
[1] "character"

5.1.2 Vector Operations

5.1.2.1 Basic Operation

Simple operations like addition, subtraction, multiplication, and other arithmetic are part of the operations that can be performed on vectors.

# Arithmetic Operation
x <- 15
y <- 20
y + x
[1] 35
y %% x
[1] 5
y / x
[1] 1.333333

When the length of one vector is greater than the other, element-wise operations is performed on them. Each element in the longer object will be operated on by the element of the short one. When we have objects with more than one element, the elements are operated on one-to-one if they are divisible without remainder. If the division is with remainder, we get a warning about the vectors not being multiples. Although this does not stop the operation, the operation gets recycled with the short-length object operating on the longer one with its first element. This phenomenon of starting the operation again in R is called vector recycling.

# Vector recycling
a <- 1:5 # Creates vector of 1, 2, 3, 4, 5
b <- 2
c <- 7:9 # Creates vector of 7, 8, 9

a + b
[1] 3 4 5 6 7
a / b
[1] 0.5 1.0 1.5 2.0 2.5
a / c
Warning in a/c: longer object length is not a multiple of shorter object length
[1] 0.1428571 0.2500000 0.3333333 0.5714286 0.6250000

When operating on vectors of different lengths, R will “recycle” the shorter vector’s values to match the longer vector’s length. This can be useful but also dangerous if not used carefully!

5.1.3 Accessing Vector Elements

To access a value in a vector we can get it via its index. Indexing allows us access or modify specific elements in different data structures. We use the square brackets [] to index the position of elements. To get the first element in the fruits vector, papaya, we specify its index position, 1, within the square brackets.

Table 5.1: Index number of fruit vector
Vector papaya orange apple pineapple grape strawberries avocado
Index number 1 2 3 4 5 6 7
fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "strawberries" "avocado"     
fruits[1]
[1] "papaya"

To return multiple element we can include the elements index position within c() and pass it to the square brackets. You can return a ranged of elements using :.

# remove elements by specific position
fruits[c(7, 5, 2)] 
[1] "avocado" "grape"   "orange" 
# remove the range of elements that fall within that number.
fruits[2:5] 
[1] "orange"    "apple"     "pineapple" "grape"    

5.1.4 Modifying Vectors

There are two ways to modify new elements in a vector:

  • append() function
  • The square bracket []

Using ˋ[]ˋ offers more flexibility, as it can be used to add, change and remove the elements of a vector.

5.1.4.0.1 The append() Function

We can use append() which add elements to the end of the vector. It takes three arguments, x, the vector that you want to add a value to, values, the new value you want to add to the vector, and after, which determines the index position where you want the new value added.

length(fruits) # number of fruits prior adding the new element
[1] 7
fruits <- append(x = fruits, values = "banana", after = 5) # adding new element element after the last element

length(fruits) # The number of elements have increased by one.
[1] 8
fruits # The new elements
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     

If the parameter after is omitted with no argument, the new values are automatically placed at the end of the vector.

new_fruits <- c("kiwi", "grape", "cherry", "mango", "peach")

extended_fruits <- append(fruits, new_fruits)
extended_fruits
 [1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
 [6] "banana"       "strawberries" "avocado"      "kiwi"         "grape"       
[11] "cherry"       "mango"        "peach"       
5.1.4.0.2 Using the Square Bracket, []

To add a new object, you place the index number where the new object will occupy in the square bracket

fruit 
[1] "papaya"
fruit[2] <- "mango" # add new element to a new index number
fruit
[1] "papaya" "mango" 

If you insert a number that is way more than the next number of the length of the vector, the spaces get filled with NA.

fruit[5] <- "Clementine"
fruit
[1] "papaya"     "mango"      NA           NA           "Clementine"

To prevent this, you can use a simple trick, pass the length of the fruit + 1 into the square bracket when you want to add the new value.

fruit[length(fruit)+1] <- "Guava"
fruit
[1] "papaya"     "mango"      NA           NA           "Clementine"
[6] "Guava"     

The square brackets is also used to modify the elements of object using their index position.

fruit[3] # Position to be changed
[1] NA
fruit[3] <- "tomato" # Replace avocado with tomato
fruit # new elements
[1] "papaya"     "mango"      "tomato"     NA           "Clementine"
[6] "Guava"     

You can also remove elements of a vector using the square bracket. You do this by passing a negative value into the index number

fruit[-4] # Show values left after removing first value.
[1] "papaya"     "mango"      "tomato"     "Clementine" "Guava"     
fruit <- fruit[-1] # Reassign to variable to confirm the change.

To confirm an object data structure, use their is.*() variant. For vectors, is.vector(), for matrix, is.matrix() and so on. To convert from one structure to another, use their as.*() variant. For vector as.vector(), for matrix, as.matrix() and so on.

5.2 Matrices

5.2.1 What is a Matrix?

A matrix is a two-dimensional data structure that holds elements of the same type arranged in rows and columns. Think of it as a table with uniform data type. To create a matrix, use the matrix() function. Within the function, specify the number of rows, nrow or number of columns ncol

mat <- matrix(1:6, nrow = 2)
mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

The number in the matrix get filled by columns. To change this arrangement, set the byrow to TRUE

matrix(1:6, nrow = 2, byrow = TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

We can confirm if the object is a matrix by using is.matrix()

is.matrix(mat) # confirm object is a matrix
[1] TRUE

You can make a matrix from a vector by passing it to the matrix() function.

fruit_mat <- matrix(fruits, nrow = 4) # make matrix from fruits vector
fruit_mat
     [,1]        [,2]          
[1,] "papaya"    "grape"       
[2,] "orange"    "banana"      
[3,] "apple"     "strawberries"
[4,] "pineapple" "avocado"     

To get the dimension of matrix use dim() function.

dim(fruit_mat)
[1] 4 2

The result [1] 4 2 is interpreted 4 by 2, i.e. four rows and two column. That matrix is thereby called a four by two matrix. Check Figure 5.1 to see an example.

Figure 5.1: matrix indexing, source: https://biocore.crg.eu

For example, Figure 5.1 shows indexing for a simple 3 * 2 matrix.

5.2.2 Matrix Operations

5.2.2.1 Basic Operations

Similar to vectors, matrix can perform arithmetic operations. For arithmetic with scalar vector, the operation is carried on each element of the matrix.

my_mat <- matrix(1:6, nrow = 3)
my_mat
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
my_mat * 5
     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30
my_mat - 15
     [,1] [,2]
[1,]  -14  -11
[2,]  -13  -10
[3,]  -12   -9
my_mat2 <- matrix(2:7, nrow = 3)
my_mat2 / my_mat
         [,1]     [,2]
[1,] 2.000000 1.250000
[2,] 1.500000 1.200000
[3,] 1.333333 1.166667

Matrix with unequal dimension cannot be added together

my_mat3 <- matrix(7:15, nrow = 3)
my_mat3
     [,1] [,2] [,3]
[1,]    7   10   13
[2,]    8   11   14
[3,]    9   12   15
my_mat3 + my_mat
Error in my_mat3 + my_mat: non-conformable arrays

5.2.2.2 Naming Rows and Columns

A property of matrix is being two dimensional. From the result printed above, we have [,1] and [1,] signifying rows and columns. These are the column and row indices of a matrix and they can replaced with names of our choosing. Using the colnames() and rownames() functions, the names of columns and rows of a matrix can be changed respectively.

# represent column names with values a and b
colnames(fruit_mat) <- letters[1:2] 

fruit_mat
     a           b             
[1,] "papaya"    "grape"       
[2,] "orange"    "banana"      
[3,] "apple"     "strawberries"
[4,] "pineapple" "avocado"     
# represent the row names with uppercase A, B, C, and D
rownames(fruit_mat) <- LETTERS[1:4] 

fruit_mat
  a           b             
A "papaya"    "grape"       
B "orange"    "banana"      
C "apple"     "strawberries"
D "pineapple" "avocado"     

5.2.3 Accessing Elements in a Matrix

Like with vectors, you can access any particular element in a matrix using the square brackets, []. Since matrix is two dimensional, there’s a little adjustment in how we access elements as we have to specify rows and columns. The syntax is [row_index, column_index]. For example, let’s access the third row and second column element of fruit_mat.

fruit_mat
  a           b             
A "papaya"    "grape"       
B "orange"    "banana"      
C "apple"     "strawberries"
D "pineapple" "avocado"     
fruit_mat[3, 2]
[1] "strawberries"

We can also access more than one element of a particular axis (row and column) by passing in a vector.

fruit_mat[c(1, 3, 2), 1] # This returns row 1, 3, and 2 elements of column 1
       A        C        B 
"papaya"  "apple" "orange" 
fruit_mat[2, c(1, 2)]
       a        b 
"orange" "banana" 

You can also use : within an axis to access elements within a range

fruit_mat[2:4, c(1, 2)] # returns row 2, 3, and 4 element of column 1 and 2.
  a           b             
B "orange"    "banana"      
C "apple"     "strawberries"
D "pineapple" "avocado"     

To return all the rows of a particular column, leave the row space within the square bracket empty, and specify the index of the column you want to return

fruit_mat[,1]
          A           B           C           D 
   "papaya"    "orange"     "apple" "pineapple" 
fruit_mat[,2]
             A              B              C              D 
       "grape"       "banana" "strawberries"      "avocado" 

To return all the columns of a particular row, the column space is left empty.

fruit_mat[3:4, ] # returns element of row 3 and 4 for all the columns.
  a           b             
C "apple"     "strawberries"
D "pineapple" "avocado"     

You can also access an element using the row and column names.

fruit_mat["A", "a"]
[1] "papaya"

5.2.4 Adding New Element to a Matrix

To add new columns to a matrix use cbind() and to add rows use rbind(). We’ll create a new matrix of fruits to show this .

new_fruit_list <- matrix(c("raspberry", "blue berries", 
                           "kiwi", "clementine"),
       nrow = 4,
       dimnames = list(letters[1:4], "C"))

new_fruit_list
  C             
a "raspberry"   
b "blue berries"
c "kiwi"        
d "clementine"  
fruit_list <- cbind(fruit_mat, new_fruit_list)
fruit_list
  a           b              C             
A "papaya"    "grape"        "raspberry"   
B "orange"    "banana"       "blue berries"
C "apple"     "strawberries" "kiwi"        
D "pineapple" "avocado"      "clementine"  

Ensure the number of rows of the matrices to be joined matches when performing a column bind and the number of columns matches when performing a row bind.

rbind(fruit_list, c("avocado", "pear", "lemon"))
  a           b              C             
A "papaya"    "grape"        "raspberry"   
B "orange"    "banana"       "blue berries"
C "apple"     "strawberries" "kiwi"        
D "pineapple" "avocado"      "clementine"  
  "avocado"   "pear"         "lemon"       

5.2.5 Transposing Matrix

Transposing matrix is done using t(). This flips row elements to columns and column elements to rows.

fruit_mat_transposed <- t(fruit_mat)
fruit_mat_transposed
  A        B        C              D          
a "papaya" "orange" "apple"        "pineapple"
b "grape"  "banana" "strawberries" "avocado"  
dim(fruit_mat_transposed)
[1] 2 4

5.3 Data frames

5.3.1 What is a Data Frame

Data frames are two-dimensional like matrix and are the standard way to store data. They are similar to Excel spreadsheets. What makes them different from matrix is their ability to store different data types. To create a data frame, use the data.frame() function. Below is a simple data frame that stores the inventory of a fruit in a store.

fruit_inventory <- data.frame(
  type = c("pineapple", "mango", "apple"), # character data type
  stock = c(5, 3, 0), # double data type
  available = c(TRUE, TRUE, FALSE) # logical data type
)

fruit_inventory
       type stock available
1 pineapple     5      TRUE
2     mango     3      TRUE
3     apple     0     FALSE

Data frames can combine vectors into a table where each vector becomes a column. The vectors to be used have to be of the same length to avoid error:

fruit <- c("orange", "mango", "apple")
stock <- c(5, 3, 0)
available <- c(TRUE, TRUE, FALSE)

fruit_tbl <- data.frame(fruit, stock, available)
fruit_tbl
   fruit stock available
1 orange     5      TRUE
2  mango     3      TRUE
3  apple     0     FALSE

5.3.2 Accessing and Modifying Elements in a Data Frame

Elements in data.frame are accessed in a similar fashion as matrix. Data frames also uses the dollar sign, $ to access columns.

fruit_tbl$stock
[1] 5 3 0
fruit_tbl$available
[1]  TRUE  TRUE FALSE

You can apply operations you want on each variable this way. For example we can check the total stocks available.

sum(fruit_tbl$stock)
[1] 8

Using the square brackets, and index number of the column we want to remove, we can remove old columns or add new columns to a data frame.

# Removing a Column
fruit_tbl[-1]
  stock available
1     5      TRUE
2     3      TRUE
3     0     FALSE
# Adding a column
fruit_tbl["location"] <- c("online", "on site", "online")
fruit_tbl
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4 List

5.4.1 What is a List

A list is a versatile data structure that can store elements of different types and sizes - including other data structures like vectors, matrices, and even other lists. Think of it like a container that can hold different kinds of boxes.

# Basic list creation
my_list <- list(
  fruits,
  fruit_mat,
  c(TRUE, FALSE),
  FALSE, 
  fruit_mat_transposed,
  fruit_tbl
)

my_list
[[1]]
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     

[[2]]
  a           b             
A "papaya"    "grape"       
B "orange"    "banana"      
C "apple"     "strawberries"
D "pineapple" "avocado"     

[[3]]
[1]  TRUE FALSE

[[4]]
[1] FALSE

[[5]]
  A        B        C              D          
a "papaya" "orange" "apple"        "pineapple"
b "grape"  "banana" "strawberries" "avocado"  

[[6]]
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4.2 Naming Objects in a List

The objects in a list can be given a name using names() function.

names(my_list) <- c("fruits", "fruit_mat", "logical_1", "logical_2",
                    "fruit_transposed_matrix", "fruit_dataframe")

On printing list now, we’ll see each object named.

my_list
$fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     

$fruit_mat
  a           b             
A "papaya"    "grape"       
B "orange"    "banana"      
C "apple"     "strawberries"
D "pineapple" "avocado"     

$logical_1
[1]  TRUE FALSE

$logical_2
[1] FALSE

$fruit_transposed_matrix
  A        B        C              D          
a "papaya" "orange" "apple"        "pineapple"
b "grape"  "banana" "strawberries" "avocado"  

$fruit_dataframe
   fruit stock available location
1 orange     5      TRUE   online
2  mango     3      TRUE  on site
3  apple     0     FALSE   online

5.4.3 Accessing Elements in a List

The square bracket, [] is used to select elements in a list but to print the items in the object you use [[]]. Like data.frame you can also use the dollar sign, $,

my_list[1]
$fruits
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     

Notice the difference when we use [[]]. The result is similar to using $ to access the objects within the list

my_list[[1]]
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     
my_list$fruits # similar to [1]
[1] "papaya"       "orange"       "apple"        "pineapple"    "grape"       
[6] "banana"       "strawberries" "avocado"     

To access the item in each object add a square bracket in front.

my_list[[1]][3]
[1] "apple"

5.5 Summary

In this chapter you learned about data structures in R. You saw how to create each data structure, how to access the elements of each structure. Also, you learned how to add and remove items from each data structure. You got introduced to checking the properties of each data structure such as their length and dimensions. Next you will learn about packages in R and how to install them, after you will bring together the knowledge you’ve gathered from chapter one into making R Scripts and R projects.