"tree"
[1] "tree"
One of the most common source of error when using R, or performing analysis in general regardless of the platform or software used is wrong, inappropriate or inconsistent data types. Understanding data types in R is fundamental for effective data analysis and manipulation. Some operations and analysis are possible because of specific types of data, getting this wrong will give unexpected results. Researchers regardless of the field are analysts working with various forms of data, which can be in form of survey responses, secondary data, field or lab measurements. These data are usually a combination of numbers in the form of a measurements, words or text, such as survey participants responses, or a ranking or category such as yes/no responses, income level or marital status. Before proceeding with any analysis, it is crucial to understand the types of data we are working with. In the research process, as seen in Figure 3.1, data collection and analysis are key stages.
Each data has its own unique structure and requirements influencing the method used in analyzing them. Understanding the distinctions between different data types ensures we can manipulate, analyze and interpret the findings from our data effectively. Here, we will explore the data types in R
There are six different data types in R, five which will be used frequently in this book. In R, the function typeof()
and class()
are used to check the data type of a variable. The data types are in R are:
Characters represent text and are written in R by enclosing contents in quotation marks. The quotation mark can either be the single '
or double "
and anything can be the content of a string as long as the quotation mark enclose the content. It is more common to use the double "
than the single '
.
"tree"
[1] "tree"
We can check the data type of the text “tree” with either class()
or typeof()
.
typeof("tree")
[1] "character"
class("tree")
[1] "character"
class("2")
[1] "character"
class("+")
[1] "character"
When we assign characters to a variable, the content of the variable will be evaluated and not the variable in its self. As explained in Chapter 2.4 variables are references and not a value in themselves.
<- "Quercus robur"
tree class(tree)
[1] "character"
Let’s look at the example below, we have error because oak
is an object with no values assigned to it. So oak
and “oak” are different. The latter is a character while the former is an object.
class(oak)
Error: object 'oak' not found
There is a way to check if an objects is a character or not different from using class()
and typeof()
. Using is.character()
function returns a TRUE or FALSE when checking if objects are character data type.
is.character("moabi")
[1] TRUE
is.character(2)
[1] FALSE
is.character("2")
[1] TRUE
Do not be surprised when you import some data into R and see that numbers are characters, so ensure you confirm that using all the is._
variations of functions you will see in this chapter.
Factors are used to represent categorical data in R. In R, factors are always leveled, and can only contain predefined values. A good example of data that can be represented as factors are survey response like educational status, income level, satisfaction rating and so on. There are two types of factors or categorical data. We have ordered (ordinal) categorical data or unordered (nominal) categorical data. The function factor()
is used to create factors in R.
These are unordered categorical data with levels. There’s no degree of distance or ranking in nominal factor variables. Example are the name of states in a country, name of trees, nursery plantation site, and so on. Below are three data, one for tree names, another for poultry birds according to use, and the last is fertilizer.
<- factor(c("Lophira alata", "Triplochiton scleroxylon", "Mansonia altissima",
tree_names "Celtis africana", "Borassus aethiopum"))
<-factor(c("breeders", "broilers", "layers"))
poultry_birds <- factor(c("N", "P", "K")) fertilizer
tree_names
[1] Lophira alata Triplochiton scleroxylon Mansonia altissima
[4] Celtis africana Borassus aethiopum
5 Levels: Borassus aethiopum Celtis africana ... Triplochiton scleroxylon
poultry_birds
[1] breeders broilers layers
Levels: breeders broilers layers
fertilizer
[1] N P K
Levels: K N P
The function c()
used above is used to combine values.
When we print this variables, notice that Levels
is included as part of the print with the categories arranged in alphabetical order. To see the levels of a factor object pass the object to the function levels()
levels(tree_names)
[1] "Borassus aethiopum" "Celtis africana"
[3] "Lophira alata" "Mansonia altissima"
[5] "Triplochiton scleroxylon"
We can check the class of these objects and confirm that they are indeed factors.
class(tree_names)
[1] "factor"
class(poultry_birds)
[1] "factor"
class(fertilizer)
[1] "factor"
When you use the typeof()
function on these objects, the result is integer
which will be discussed in 3.3.1. The result is so because factors are built on top of integers.
typeof(tree_names)
[1] "integer"
Ordinal factors are somewhat different from nominal or regular factors. When the nominal factors above where created, we did it without specifying a level argument, when the level is specified, we are indicating a rank or degree of distance between the factors. An example is month, days of the week, satisfaction level, employee’s position in an organization and so on. Let’s construct a hypothetical survey response on the satisfaction level of having trees cut down for other needs other than conservation:
<- factor(c("very satisfied", "satisfied", "satisfied",
satisfaction_level "dissatisfied", "very satisfied", "very dissatisfied",
"very satisfied", "dissatisfied"),
levels = c("very dissatisfied", "dissatisfied",
"neutral", "satisfied", "very satisfied"),
ordered = TRUE
)
satisfaction_level
[1] very satisfied satisfied satisfied dissatisfied
[5] very satisfied very dissatisfied very satisfied dissatisfied
5 Levels: very dissatisfied < dissatisfied < neutral < ... < very satisfied
class(satisfaction_level)
[1] "ordered" "factor"
On printing the variable we can see that:
levels
,WWe can perform a quick check on the frequency of each category using the function table()
.
table(satisfaction_level)
satisfaction_level
very dissatisfied dissatisfied neutral satisfied
1 2 0 2
very satisfied
3
The result shows neutral
as having zero, meaning that no respondent choose neutral in this hypothetical survey.
we have been using functions a lot now without properly defining what functions are. Just keep in mind that functions are little programs or block of codes that perform specified actions. More will be discussion on that in Chapter 4.
To confirm if a variable is a factor data type use the is.factor()
. Let’s create a new survey response, but this time it won’t be wrapped in the factor variable.
<- c("very satisfied", "satisfied", "satisfied",
satisfaction_level_chr "dissatisfied", "very satisfied", "very dissatisfied",
"very satisfied", "dissatisfied")
is.factor(satisfaction_level_chr)
[1] FALSE
is.factor(satisfaction_level)
[1] TRUE
We can also confirm if a factor is ordered or not using the function is.ordered()
is.ordered(satisfaction_level)
[1] TRUE
Numeric data are digits. There are two types of numeric data; integer and double.
Integer
are whole numbers and are written by writing a digit followed by a L
.
5L
[1] 5
class(5L)
[1] "integer"
When divided by another integer or another numeric data type the result is usually a double
.
25L/5L
[1] 5
typeof(25L / 5L)
[1] "double"
Just like characters and factors, we can also check if a particular object is an integer using is.integer()
.
is.integer(5)
[1] FALSE
is.integer(5L)
[1] TRUE
Doubles are also referred to as floats. They are numbers with decimal point.
5.1
[1] 5.1
typeof(5.1)
[1] "double"
When a mathematical operation is performed on a double, a double is return
typeof(1.5/3)
[1] "double"
To check if an object is a double we use its is._
variant, in this case is.double()
is.double(2.9)
[1] TRUE
A sequence of integers or doubles can be created using :
with the start on the left and the end at the right of the symbol.
1:15
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
As is.integer()
returns true for all integers and is.doubl()e
returns true for doubles, is.numeric()
returns true for all both types of data.
is.numeric(5L)
[1] TRUE
is.numeric(3.5)
[1] TRUE
Logical data type are also referred to as Boolean values. Logical includes TRUE
, FALSE
and NA
which means Not Available.
class(TRUE)
[1] "logical"
class(NA)
[1] "logical"
class(FALSE)
[1] "logical"
In Chapter 2.2, we saw that comparison operators always returns a logical operator as their result. We can also check the type as it is evaluated.
class(5 > 2)
[1] "logical"
5 > 2
is first evaluated which returns TRUE
, then the class()
of the result is checked which returns TRUE
. In R, the deepest of the nested expressions are evaluated first and evaluated outwardly to the umbrella expression. We can also check for a logical value using is.logical()
is.logical(20 < 3)
[1] TRUE
is.logical(NA)
[1] TRUE
We will not use the complex data type and its unlikely to analyze these type of data. Complex data types are numbers with an imaginary term i
added to them
5i
[1] 0+5i
1 + 5i
[1] 1+5i
typeof(3i)
[1] "complex"
typeof(5 + 2i)
[1] "complex"
Similar to other data types, we can confirm if a data is a complex by using is.complex()
is.complex(5i)
[1] TRUE
In R we can change data types using as._
similar to is.
which is used to check a data type.
<- 20:40
tree_height tree_height
[1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
typeof(tree_height)
[1] "integer"
The above can be changed to a double using as.double()
and back to integer using as.integer()
<- as.double(tree_height)
tree_height_2 typeof(tree_height_2)
[1] "double"
We can also change characters to factors and factors to characters in a similar way
<- c("terminalia", "eucalyptus", "iroko", "oak")
tree_names class(tree_names)
[1] "character"
To change the above to factor data type we’ll use the as.factor()
function.
<- as.factor(tree_names)
tree_names_fct
class(tree_names_fct)
[1] "factor"
We can change this back to character by using as.character()
.
<- as.character(tree_names_fct)
tree_names_chr class(tree_names_chr)
[1] "character"
as.character()
can be used to coerce any data type to a character
tree_height
[1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
class(tree_height)
[1] "integer"
<- as.character(tree_height)
tree_height_chr class(tree_height_chr)
[1] "character"
Characters can be changed back to numeric or double data type using the as.double()
, as.integer()
or as.numeric()
function. The functions will return numbers only if what is been converted contains only digits, else NA is returned as the result. we get a warning when there’s not a digit in the object being converted.
<- c(10, 15, 20, "25m")
tree_height tree_height
[1] "10" "15" "20" "25m"
class(tree_height)
[1] "character"
<- as.double(tree_height) tree_height
Warning: NAs introduced by coercion
tree_height
[1] 10 15 20 NA
class(tree_height)
[1] "numeric"
we can also convert factors to integers. As said in 3.2, factors are built on integers.
<- factor(c("orange", "tomato", "olive"))
fruits class(fruits)
[1] "factor"
as.integer(fruits)
[1] 2 3 1
We can also convert other data types to logical data type. All other numbers evaluate to TRUE either if they are positive or negative except zero which evaluates to zero.
as.logical(c(1, 0, -1, 3, 0, 0 , 32))
[1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE
For characters, every other thing converts to NA, except T, TRUE, True, and true which converts to TRUE
and F, FALSE, False and false which converts to FALSE
as.logical(c("TRUE", 1, 0, "FALSE", "F", "T", 3, "False", "man", "true", "tRUE", "True", "t"))
[1] TRUE NA NA FALSE FALSE TRUE NA FALSE NA TRUE NA TRUE
[13] NA
Data coercion/conversion can either be implicit or explicit. Coercion is explicit when we intentionally change the variable type. Implicit coercion is when R changes the variable type for us automatically.
There are other data types not covered such as:
Chapter 3 of Advanced R by Hadley Wickham, 2019 gives in-depth understanding of the data types in R.
Hands-On Programming with R by Garret Grolemund explain the concepts of R in a game-like approach using deck of cards.
Data types in R include character, factor, double, integer, and logical. This data type can be checked using class()
, typeof()
and their adjoining is._
function. These data types can be converted from one type to another using as._
.