3  Data Types in R

One of the most common source of error when using R, or performing analysis in general regardless of the platform or software used is wrong, inappropriate or inconsistent data types. Understanding data types in R is fundamental for effective data analysis and manipulation. Some operations and analysis are possible because of specific types of data, getting this wrong will give unexpected results. Researchers regardless of the field are analysts working with various forms of data, which can be in form of survey responses, secondary data, field or lab measurements. These data are usually a combination of numbers in the form of a measurements, words or text, such as survey participants responses, or a ranking or category such as yes/no responses, income level or marital status. Before proceeding with any analysis, it is crucial to understand the types of data we are working with. In the research process, as seen in Figure 3.1, data collection and analysis are key stages.

Each data has its own unique structure and requirements influencing the method used in analyzing them. Understanding the distinctions between different data types ensures we can manipulate, analyze and interpret the findings from our data effectively. Here, we will explore the data types in R

Figure 3.1: Stages of Research

There are six different data types in R, five which will be used frequently in this book. In R, the function typeof() and class() are used to check the data type of a variable. The data types are in R are:

3.1 Character

Characters represent text and are written in R by enclosing contents in quotation marks. The quotation mark can either be the single ' or double " and anything can be the content of a string as long as the quotation mark enclose the content. It is more common to use the double " than the single '.

"tree"
[1] "tree"

We can check the data type of the text “tree” with either class() or typeof().

typeof("tree")
[1] "character"
class("tree")
[1] "character"
class("2")
[1] "character"
class("+")
[1] "character"

When we assign characters to a variable, the content of the variable will be evaluated and not the variable in its self. As explained in Chapter 2.4 variables are references and not a value in themselves.

tree <- "Quercus robur"
class(tree)
[1] "character"

Let’s look at the example below, we have error because oak is an object with no values assigned to it. So oak and “oak” are different. The latter is a character while the former is an object.

class(oak)
Error: object 'oak' not found

There is a way to check if an objects is a character or not different from using class() and typeof(). Using is.character() function returns a TRUE or FALSE when checking if objects are character data type.

is.character("moabi")
[1] TRUE
is.character(2)
[1] FALSE
is.character("2")
[1] TRUE

Do not be surprised when you import some data into R and see that numbers are characters, so ensure you confirm that using all the is._ variations of functions you will see in this chapter.

3.2 Factors

Factors are used to represent categorical data in R. In R, factors are always leveled, and can only contain predefined values. A good example of data that can be represented as factors are survey response like educational status, income level, satisfaction rating and so on. There are two types of factors or categorical data. We have ordered (ordinal) categorical data or unordered (nominal) categorical data. The function factor() is used to create factors in R.

3.2.1 Nominal Factor

These are unordered categorical data with levels. There’s no degree of distance or ranking in nominal factor variables. Example are the name of states in a country, name of trees, nursery plantation site, and so on. Below are three data, one for tree names, another for poultry birds according to use, and the last is fertilizer.

tree_names <- factor(c("Lophira alata", "Triplochiton scleroxylon", "Mansonia altissima",
                       "Celtis africana", "Borassus aethiopum"))
poultry_birds <-factor(c("breeders", "broilers", "layers"))
fertilizer <- factor(c("N", "P", "K"))
tree_names
[1] Lophira alata            Triplochiton scleroxylon Mansonia altissima      
[4] Celtis africana          Borassus aethiopum      
5 Levels: Borassus aethiopum Celtis africana ... Triplochiton scleroxylon
poultry_birds
[1] breeders broilers layers  
Levels: breeders broilers layers
fertilizer
[1] N P K
Levels: K N P

The function c() used above is used to combine values.

When we print this variables, notice that Levels is included as part of the print with the categories arranged in alphabetical order. To see the levels of a factor object pass the object to the function levels()

levels(tree_names)
[1] "Borassus aethiopum"       "Celtis africana"         
[3] "Lophira alata"            "Mansonia altissima"      
[5] "Triplochiton scleroxylon"

We can check the class of these objects and confirm that they are indeed factors.

class(tree_names)
[1] "factor"
class(poultry_birds)
[1] "factor"
class(fertilizer)
[1] "factor"

When you use the typeof() function on these objects, the result is integer which will be discussed in 3.3.1. The result is so because factors are built on top of integers.

typeof(tree_names)
[1] "integer"

3.2.2 Ordinal Factors

Ordinal factors are somewhat different from nominal or regular factors. When the nominal factors above where created, we did it without specifying a level argument, when the level is specified, we are indicating a rank or degree of distance between the factors. An example is month, days of the week, satisfaction level, employee’s position in an organization and so on. Let’s construct a hypothetical survey response on the satisfaction level of having trees cut down for other needs other than conservation:

satisfaction_level <- factor(c("very satisfied", "satisfied", "satisfied",
                               "dissatisfied", "very satisfied", "very dissatisfied",
                               "very satisfied", "dissatisfied"), 
                             levels = c("very dissatisfied", "dissatisfied",
                                        "neutral", "satisfied", "very satisfied"),
                             ordered = TRUE
                )
satisfaction_level
[1] very satisfied    satisfied         satisfied         dissatisfied     
[5] very satisfied    very dissatisfied very satisfied    dissatisfied     
5 Levels: very dissatisfied < dissatisfied < neutral < ... < very satisfied
class(satisfaction_level)
[1] "ordered" "factor" 

On printing the variable we can see that:

  • The levels are not arranged according to alphabetical order,
  • The order arrangement of the levels when we print the follows the order specified in the argument levels,
  • The value neutral indicated in levels but not in the data collected is still printed as part of the levels. This indicate that factors are indeed predefined even if data on a particular factor is not collected.

WWe can perform a quick check on the frequency of each category using the function table().

table(satisfaction_level)
satisfaction_level
very dissatisfied      dissatisfied           neutral         satisfied 
                1                 2                 0                 2 
   very satisfied 
                3 

The result shows neutral as having zero, meaning that no respondent choose neutral in this hypothetical survey.

we have been using functions a lot now without properly defining what functions are. Just keep in mind that functions are little programs or block of codes that perform specified actions. More will be discussion on that in Chapter 4.

To confirm if a variable is a factor data type use the is.factor() . Let’s create a new survey response, but this time it won’t be wrapped in the factor variable.

satisfaction_level_chr <- c("very satisfied", "satisfied", "satisfied",
                               "dissatisfied", "very satisfied", "very dissatisfied",
                               "very satisfied", "dissatisfied")
is.factor(satisfaction_level_chr)
[1] FALSE
is.factor(satisfaction_level)
[1] TRUE

We can also confirm if a factor is ordered or not using the function is.ordered()

is.ordered(satisfaction_level)
[1] TRUE

3.3 Numeric Data type

Numeric data are digits. There are two types of numeric data; integer and double.

3.3.1 Integer

Integer are whole numbers and are written by writing a digit followed by a L.

5L
[1] 5
class(5L)
[1] "integer"

When divided by another integer or another numeric data type the result is usually a double.

25L/5L
[1] 5
typeof(25L / 5L)
[1] "double"

Just like characters and factors, we can also check if a particular object is an integer using is.integer().

is.integer(5)
[1] FALSE
is.integer(5L)
[1] TRUE

3.3.2 Doubles

Doubles are also referred to as floats. They are numbers with decimal point.

5.1
[1] 5.1
typeof(5.1)
[1] "double"

When a mathematical operation is performed on a double, a double is return

typeof(1.5/3)
[1] "double"

To check if an object is a double we use its is._ variant, in this case is.double()

is.double(2.9)
[1] TRUE

A sequence of integers or doubles can be created using : with the start on the left and the end at the right of the symbol.

1:15
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

As is.integer() returns true for all integers and is.doubl()e returns true for doubles, is.numeric() returns true for all both types of data.

is.numeric(5L)
[1] TRUE
is.numeric(3.5)
[1] TRUE

3.4 Logical

Logical data type are also referred to as Boolean values. Logical includes TRUE, FALSE and NA which means Not Available.

class(TRUE)
[1] "logical"
class(NA)
[1] "logical"
class(FALSE)
[1] "logical"

In Chapter 2.2, we saw that comparison operators always returns a logical operator as their result. We can also check the type as it is evaluated.

class(5 > 2)
[1] "logical"

5 > 2 is first evaluated which returns TRUE, then the class() of the result is checked which returns TRUE. In R, the deepest of the nested expressions are evaluated first and evaluated outwardly to the umbrella expression. We can also check for a logical value using is.logical()

is.logical(20 < 3)
[1] TRUE
is.logical(NA)
[1] TRUE

3.5 Complex

We will not use the complex data type and its unlikely to analyze these type of data. Complex data types are numbers with an imaginary term i added to them

5i
[1] 0+5i
1 + 5i
[1] 1+5i
typeof(3i)
[1] "complex"
typeof(5 + 2i)
[1] "complex"

Similar to other data types, we can confirm if a data is a complex by using is.complex()

is.complex(5i)
[1] TRUE

3.6 Changing Data Types

In R we can change data types using as._ similar to is. which is used to check a data type.

tree_height <- 20:40
tree_height
 [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
typeof(tree_height)
[1] "integer"

The above can be changed to a double using as.double() and back to integer using as.integer()

tree_height_2 <- as.double(tree_height)
typeof(tree_height_2)
[1] "double"

We can also change characters to factors and factors to characters in a similar way

tree_names <- c("terminalia", "eucalyptus", "iroko", "oak")
class(tree_names)
[1] "character"

To change the above to factor data type we’ll use the as.factor() function.

tree_names_fct <- as.factor(tree_names)

class(tree_names_fct)
[1] "factor"

We can change this back to character by using as.character().

tree_names_chr <- as.character(tree_names_fct)
class(tree_names_chr)
[1] "character"

as.character() can be used to coerce any data type to a character

tree_height
 [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
class(tree_height)
[1] "integer"
tree_height_chr <- as.character(tree_height)
class(tree_height_chr)
[1] "character"

Characters can be changed back to numeric or double data type using the as.double(), as.integer() or as.numeric() function. The functions will return numbers only if what is been converted contains only digits, else NA is returned as the result. we get a warning when there’s not a digit in the object being converted.

tree_height <- c(10, 15, 20, "25m")
tree_height
[1] "10"  "15"  "20"  "25m"
class(tree_height)
[1] "character"
tree_height <- as.double(tree_height)
Warning: NAs introduced by coercion
tree_height
[1] 10 15 20 NA
class(tree_height)
[1] "numeric"

we can also convert factors to integers. As said in 3.2, factors are built on integers.

fruits <- factor(c("orange", "tomato", "olive"))
class(fruits)
[1] "factor"
as.integer(fruits)
[1] 2 3 1

We can also convert other data types to logical data type. All other numbers evaluate to TRUE either if they are positive or negative except zero which evaluates to zero.

as.logical(c(1, 0, -1, 3, 0, 0 , 32))
[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

For characters, every other thing converts to NA, except T, TRUE, True, and true which converts to TRUE and F, FALSE, False and false which converts to FALSE

as.logical(c("TRUE", 1, 0, "FALSE", "F", "T", 3, "False", "man", "true", "tRUE", "True", "t"))
 [1]  TRUE    NA    NA FALSE FALSE  TRUE    NA FALSE    NA  TRUE    NA  TRUE
[13]    NA

Data coercion/conversion can either be implicit or explicit. Coercion is explicit when we intentionally change the variable type. Implicit coercion is when R changes the variable type for us automatically.

3.7 Other Data Types

There are other data types not covered such as:

  • Date
  • POSIXct
  • Raw

Summary

Data types in R include character, factor, double, integer, and logical. This data type can be checked using class(), typeof() and their adjoining is._ function. These data types can be converted from one type to another using as._.