Welcome Back

Google icon Sign in with Google
OR
I agree to abide by Pharmadaily Terms of Service and its Privacy Policy

Create Account

Google icon Sign up with Google
OR
By signing up, you agree to our Terms of Service and Privacy Policy
Instagram
youtube
Facebook

Joining Multiple Datasets

In real-world data analysis, information is often stored in multiple datasets instead of a single table. For example, one dataset may contain employee details, while another dataset may contain department information. To perform meaningful analysis, these datasets need to be combined. This process is known as joining datasets.

The dplyr package provides several functions that make it easy to join multiple datasets based on a common column, often called a key. These functions allow users to merge data in different ways depending on the analysis requirements.

To begin, the dplyr package must be loaded into the R session.

library(dplyr)

Suppose we have two datasets. The first dataset, employees, contains employee information such as employee ID, name, and department ID. The second dataset, departments, contains department ID and department name.

To combine these datasets, dplyr provides several join functions. The most commonly used join functions are left_join(), right_join(), inner_join(), and full_join().

A left join keeps all the rows from the first dataset and adds matching data from the second dataset based on the common column. For example:

employees %>%
  left_join(departments, by = "department_id")

This command keeps all employees and adds department names where the department ID matches.

An inner join keeps only the rows that have matching values in both datasets. This means only employees with valid department IDs will appear in the result.

employees %>%
  inner_join(departments, by = "department_id")

A right join keeps all rows from the second dataset and adds matching rows from the first dataset.

employees %>%
  right_join(departments, by = "department_id")

A full join keeps all rows from both datasets. If there is no match, the missing values are filled with NA.

employees %>%
  full_join(departments, by = "department_id")

Joining datasets is an essential step in data preparation because it allows analysts to combine related information from different sources. This helps in building more complete datasets and performing deeper and more accurate analysis.