Grouping and Summarizing Data
Join our community on Telegram!
Join the biggest community of Pharma students and professionals.
In data analysis, it is often necessary to calculate summary statistics such as totals, averages, counts, or maximum and minimum values. These summaries become more meaningful when they are calculated for specific groups within the data. The dplyr package provides simple and efficient functions to group data and compute summary statistics.
Grouping data means dividing the dataset into categories based on one or more variables. After grouping, summary calculations can be performed separately for each group. In dplyr, the group_by() function is used to create groups, and the summarise() function is used to compute summary statistics.
To begin, the dplyr package must be loaded into the R session.
library(dplyr)
Suppose we have a dataset called employees that contains the columns name, department, age, and salary.
If we want to calculate the average salary of all employees, we can use the summarise() function:
employees %>%
summarise(average_salary = mean(salary))
This command calculates the average value of the salary column for the entire dataset.
If we want to calculate the average salary for each department, we first group the data by the department column and then apply summarise():
employees %>%
group_by(department) %>%
summarise(average_salary = mean(salary))
This code divides the dataset into groups based on departments and calculates the average salary separately for each department.
Multiple summary calculations can also be performed at the same time. For example, we can calculate the average salary, total salary, and number of employees in each department:
employees %>%
group_by(department) %>%
summarise(
average_salary = mean(salary),
total_salary = sum(salary),
employee_count = n()
)
The n() function is used to count the number of rows in each group. This helps in understanding how many records belong to each category.
Grouping and summarizing data are essential steps in data analysis. They help transform raw data into meaningful insights by showing patterns, trends, and comparisons across different groups within the dataset.
