3 Tidy | The Tidyverse Cookbook (2024)

This chapter includes the following recipes:

  1. Create a tibble manually
  2. Convert a data frame to a tibble
  3. Convert a tibble to a data frame
  4. Preview the contents of a tibble
  5. Inspect every cell of a tibble
  6. Spread a pair of columns into a field of cells
  7. Gather a field of cells into a pair of columns
  8. Separate a column into new columns
  9. Unite multiple columns into a single column

What you should know before you begin

Data tidying refers to reshaping your data into a tidy data frame or tibble. Data tidying is an important first step for your analysis because every tidyverse function will expect your data to be stored as Tidy Data.

Tidy data is tabular data organized so that:

  1. Each column contains a single variable
  2. Each row contains a single observation

Tidy data is not an arbitrary requirement of the tidyverse; it is the ideal data format for doing data science with R. Tidy data makes it easy to extract every value of a variable to build a plot or to compute a summary statistic. Tidy data also makes it easy to compute new variables; when your data is tidy, you can rely on R’s rowwise operations to maintain the integrity of your observations. Moreover, R can directly manipulate tidy data with R’s fast, built-in vectorised observations, which lets your code run as fast as possible.

The definition of Tidy Data isn’t complete until you define variable and observation, so let’s borrow two definitions from R for Data Science:

  1. A variable is a quantity, quality, or property that you can measure.
  2. An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).

As you work with data, you will be surprised to realize that what is a variable (or observation) will depend less on the data itself and more on what you are trying to do with it. With enough mental flexibility, you can consider anything to be a variable. However, some variables will be more useful than others for any specific task. In general, if you can formulate your task as an equation (math or code that contains an equals sign), the most useful variables will be the names in the equation.

3.1 Create a tibble manually

You want to create a tibble from scratch by typing in the contents of the tibble.

Solution

## # A tibble: 3 x 3## number letter greek## <dbl> <chr> <chr>## 1 1 a alpha## 2 2 b beta ## 3 3 c gamma

Discussion

tribble() creates a tibble and tricks you into typing out a preview of the result. To use tribble(), list each column name preceded by a ~, then list the values of the tribble in a rowwise fashion. If you take care to align your columns, the transposed syntax of tribble() becomes a preview of the table.

You can also create a tibble with tibble(), whose syntax mirrors data.frame():

3.2 Convert a data frame to a tibble

You want to convert a data frame to a tibble.

Solution

## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # … with 140 more rows

3.3 Convert a tibble to a data frame

You want to convert a tibble to a data frame.

Solution

## country year cases population## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Discussion

Be careful to use as.data.frame() and not as_data_frame(), which is an alias for as_tibble().

3.4 Preview the contents of a tibble

You want to get an idea of what variables and values are stored in a tibble.

Solution

## # A tibble: 10,010 x 13## name year month day hour lat long status category wind pressure## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,## # hu_diameter <dbl>

Discussion

When you call a tibble directly, R will display enough information to give you a quick sense of the contents of the tibble. This includes:

  1. the dimensions of the tibble
  2. the column names and types
  3. as many cells of the tibble as will fit comfortably in your console window

3.5 Inspect every cell of a tibble

You want to see every value that is stored in a tibble.

Solution

Discussion

View() (with a capital V) opens the tibble in R’s data viewer, which will let you scroll to every cell in the tibble.

3.6 Spread a pair of columns into a field of cells

You want to pivot, convert long data to wide, or move variable names out of the cells and into the column names. These are different ways of describing the same action.

3 Tidy | The Tidyverse Cookbook (1)

For example, table2 contains type, which is a column that repeats the variable names case and population. To make table2 tidy, you must move case and population values into their own columns.

## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583

Solution

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Discussion

To use spread(), assign the column that contains variable names to key. Assign the column that contains the values that are associated with those names to value. spread() will:

  1. Make a copy of the original table
  2. Remove the key and value columns from the copy
  3. Remove every duplicate row in the data set that remains
  4. Insert a new column for each unique variable name in the key column
  5. Fill the new columns with the values of the value column in a way that preserves every relationship between values in the original data set

    Since this is easier to see than explain, you may want to study the diagram and result above.

Each new column created by spread() will inherit the data type of the value column. If you would to convert each new column to the most sensible data type given its final contents, add the argument convert = TRUE.

3.7 Gather a field of cells into a pair of columns

You want to convert wide data to long, reshape a two-by-two table, or move variable values out of the column names and into the cells. These are different ways of describing the same action.

3 Tidy | The Tidyverse Cookbook (2)

For example, table4a is a two-by-two table with the column names 1999 and 2000. These names are values of a year variable. The field of cells in table4a contains counts of TB cases, which is another variable. To make table4a tidy, you need to move year and case values into their own columns.

## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766

Solution

## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Brazil 1999 37737## 3 China 1999 212258## 4 Afghanistan 2000 2666## 5 Brazil 2000 80488## 6 China 2000 213766

Discussion

gather() is the inverse of spread(): gather() collapses a field of cells that spans several columns into two new columns:

  1. A column of former “keys”, which contains the column names of the former field
  2. A column of former “values”, which contains the cell values of the former field

To use gather(), pick names for the new key and value columns, and supply them as strings. Then identify the columns to gather into the new key and value columns. gather() will:

  1. Create a copy of the original table
  2. Remove the identified columns from the copy
  3. Add a key column with the supplied name
  4. Fill the key column with the column names of the removed columns, repeating rows as necessary so that each combination of row and removed column name appears once
  5. Add a value column with the supplied name
  6. Fill the value column with the values of the removed columns in a way that preserves every relationship between values and column names in the original data set

    Since this is easier to see than explain, you may want to study the diagram and result above.

Identify columns to gather

You can identify the columns to gather (i.e.remove) by:

  1. name
  2. index (numbers)
  3. inverse index (negative numbers that specifiy the columns to retain, all other columns will be removed.)
  4. the select() helpers that come in the dplyr package

So for example, the following commands will do the same thing as the solution above:

By default, the new key column will contain character strings. If you would like to convert the new key column to the most sensible data type given its final contents, add the argument convert = TRUE.

3.8 Separate a column into new columns

You want to split a single column into multiple columns by separating each cell in the column into a row of cells. Each new cell should contain a separate portion of the value in the original cell.

3 Tidy | The Tidyverse Cookbook (3)

For example, table3 combines cases and population values in a single column named rate. To tidy table3, you need to separate rate into two columns: one for the cases variable and one for the population variable.

## # A tibble: 6 x 3## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Solution

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Discussion

To use separate(), pass col the name of the column to split, and pass into a vector of names for the new columns to split col into. You should supply one name for each new column that you expect to appear in the result; a mismatch will imply that something went wrong.

separate() will:

  1. Create a copy of the original data set
  2. Add a new column for each value of into. The values will become the names of the new columns.
  3. Split each cell of col into multiple values, based on the locations of a separator character.
  4. Place the new values into the new columns in order, one value per column
  5. Remove the col column. Add the argument remove = FALSE to retain the col column in the final result.

    Since this is easier to see than explain, you may want to study the diagram and result above.

Each new column created by separate() will inherit the data type of the col column. If you would like to convert each new column to the most sensible data type given its final contents, add the argument convert = TRUE.

Control where cells are separated

By default, separate() will use non-alpha-numeric characters as a separators. Pass a regular expression to the sep argument to specify a different set of separators. Alternatively, pass an integer vector to the sep argument to split cells into sequences that each have a specific number of characters:

  • sep = 1 will split each cell between the first and second character.
  • sep = c(1, 3) will split each cell between the first and second character and then again between the third and fourth character.
  • sep = -1 will split each cell between the last and second to last character.
Separate into multiple rows

separate_rows() behaves like separate() except that it places each new value into a new row (instead of into a new column).

3 Tidy | The Tidyverse Cookbook (4)

To use separate_rows(), follow the same syntax as separate().

3.9 Unite multiple columns into a single column

You want to combine several columns into a single column by uniting their values across rows.

3 Tidy | The Tidyverse Cookbook (5)

For example, table5 splits the year variable across two columns: century and year. To make table5 tidy, you need to unite century and year into a single column.

## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583

Solution

## # A tibble: 6 x 3## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Discussion

To use unite(), give the col argument a character string to use as the name of the new column to create. Then list the columns to combine. Finally, give the sep argument a separator character to use to paste together the values in the cells of each column. unite() will:

  1. Create a copy of the original data set
  2. Paste together the values of the listed columns in a vectorized (i.e.rowwise) fashion. unite() will place the value of sep between each value during the paste process.
  3. Append the results as a new column whose name is the value of col
  4. Remove the listed columns. To retain the columns in the result, add the argument remove = FALSE.

    Since this is easier to see than explain, you may want to study the diagram and result above.

If you do not suppy a sep value, unite() will use _ as a separator character. To avoid a separator character, use sep = "".

3 Tidy | The Tidyverse Cookbook (2024)

FAQs

What is tidy function in tidyverse? ›

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you'll spend less time fighting with the tools and more time working on your analysis.

What is the best way to learn the tidyverse? ›

The best place to start learning the tidyverse is R for Data Science (R4DS for short), an O'Reilly book written by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. It's designed to take you from knowing nothing about R or the tidyverse to having all the basic tools of data science at your fingertips.

Do tibbles require data to be tidy? ›

Data tidying refers to reshaping your data into a tidy data frame or tibble. Data tidying is an important first step for your analysis because every tidyverse function will expect your data to be stored as Tidy Data.

What is the tidyverse package in R? ›

Tidyverse is an R programming package that helps to transform and better present data. It assists with data import, tidying, manipulation, and data visualization. The tidyverse package is open source, meaning that it is freely available to use and is constantly being modified and improved.

What package is tidy() in R? ›

The tidy() function in the broom package takes the messy output of built-in functions in R, such as lm() , and turns them into tidy data frames.

Is tidyverse and Tidyr the same? ›

tidyr is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame: each row is a unit of observation. each column is a single piece of information.

What is the purpose of tidy data? ›

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. Compare the different versions of the classroom data: in the messy version you need to use different strategies to extract different variables.

How to tell if a dataset is tidy? ›

There are three interrelated rules that make a dataset tidy:
  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Why is tidyverse better than base R? ›

For instance, tidyverse is often your best bet for quick and easy data manipulation. Grouping datasets by many variables to create summary statistics is much easier with packages like dplyr than with Base-R functions.

What does %>% mean in tidyverse? ›

Use %>% to emphasise a sequence of actions, rather than the object that the actions are being performed on. Avoid using the pipe when: You need to manipulate more than one object at a time. Reserve pipes for a sequence of steps applied to one primary object.

How many packages are in tidyverse? ›

The tidyverse is a collection of R packages that are designed to work well together. There are about 25 packages in the tidyverse.

What does tidyr do in R? ›

The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It's particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.

What is a tidy format in R? ›

Definition of a tidy data set

In R, it is easiest to work with data that follow five basic rules: Every variable is stored in its own column. Every observation is stored in its own row—that is, every row corresponds to a single case. Each value of a variable is stored in a cell of the table.

Is tidytext in tidyverse? ›

tidytext is an R package that applies the principles of the tidyverse to analyzing text. (We will also touch upon the quanteda package, which is good for quantitative tasks like counting the number of words and syllables in a body of text.)

Is Tidymodels included in tidyverse? ›

Modeling with the tidyverse uses the collection of tidymodels packages, which largely replace the modelr package used in R4DS. These packages provide a comprehensive foundation for creating and using models of all types. Visit the Getting Started guide or, for more detailed examples, go straight to the Learn page.

References

Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 5903

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.