Selecting and Filtering Data
Join our community on Telegram!
Join the biggest community of Pharma students and professionals.
Selecting and filtering data are two of the most common tasks performed during data analysis. Before analyzing or visualizing data, it is often necessary to focus only on the relevant columns and rows. The dplyr package provides simple and readable functions to perform these operations efficiently.
Selecting data refers to choosing specific columns from a dataset. This is useful when a dataset contains many variables, but only a few of them are needed for analysis. Filtering data refers to extracting only those rows that meet certain conditions, such as values greater than a threshold, matching categories, or falling within a range.
In dplyr, the select() function is used to choose columns, and the filter() function is used to select rows based on conditions. Both functions are designed to work directly with data frames and produce clear, easy-to-read code.
To begin, the dplyr package must be loaded into the R session.
library(dplyr)
Suppose we have a dataset called employees that contains the columns name, age, department, and salary.
Selecting specific columns can be done using the select() function. For example, if only the name and salary columns are needed:
employees %>%
select(name, salary)
Filtering rows is done using the filter() function. For example, to see employees older than 30 years:
employees %>%
filter(age > 30)
Multiple conditions can also be applied. For example, employees older than 30 and working in Sales:
employees %>%
filter(age > 30, department == "Sales")
Selecting and filtering can also be combined:
employees %>%
filter(age > 30) %>%
select(name, salary)
These operations are essential in data manipulation because they allow analysts to focus only on the relevant parts of the dataset. They are usually the first step in preparing data for deeper analysis, visualization, or modeling.
