Welcome Back

Google icon Sign in with Google
OR
I agree to abide by Pharmadaily Terms of Service and its Privacy Policy

Create Account

Google icon Sign up with Google
OR
By signing up, you agree to our Terms of Service and Privacy Policy
Instagram
youtube
Facebook

Selecting and Filtering Data

Selecting and filtering data are two of the most common tasks performed during data analysis.  Before analyzing or visualizing data, it is often necessary to focus only on the relevant columns  and rows. The dplyr package provides simple and readable functions to perform  these operations efficiently.

Selecting data refers to choosing specific columns from a dataset. This is useful when a dataset  contains many variables, but only a few of them are needed for analysis. Filtering data refers to  extracting only those rows that meet certain conditions, such as values greater than a threshold,  matching categories, or falling within a range.

In dplyr, the select() function is used to choose columns, and the  filter() function is used to select rows based on conditions. Both functions  are designed to work directly with data frames and produce clear, easy-to-read code.

To begin, the dplyr package must be loaded into the R session.

library(dplyr)

Suppose we have a dataset called employees that contains the columns  name, age, department, and salary.

Selecting specific columns can be done using the select() function.  For example, if only the name and salary columns are needed:

employees %>%
  select(name, salary)

Filtering rows is done using the filter() function.  For example, to see employees older than 30 years:

employees %>%
  filter(age > 30)

Multiple conditions can also be applied. For example, employees older than 30 and working in Sales:

employees %>%
  filter(age > 30, department == "Sales")

Selecting and filtering can also be combined:

employees %>%
  filter(age > 30) %>%
  select(name, salary)

These operations are essential in data manipulation because they allow analysts to focus only on  the relevant parts of the dataset. They are usually the first step in preparing data for deeper  analysis, visualization, or modeling.