[Part 4] - How to subset records and add new columns in a dataframe in R using RStudio

[Part 4] - How to subset records and add new columns in a dataframe in R using RStudio

Slicing and dicing with data on a dataframe is a key skill in the data wrangling process. After reading this article you will be able to

  1. Select the columns that will be used in the analysis
  2. Add new columns to a dataset
  3. Choose or filter records from a dataset using predefined criteria.

We'll use the songs dataset for all illustrations. You can download the song dataset by clicking here.

#  read the dataset
Songs_DF <-  read.csv("Hindi_Songs.csv")

image.png

Subset columns

The dataset contains many columns, but we are only interested in a few of them, like Song title, Singer, and Views.

Subset_DF = Songs_DF[, c("Song.Title" , "Singer", "Views" )]

The dataframe argument is divided into two parts separated by a comma. The first section is used to filter rows or records, while the second section is used to select columns. Subset_Data.png

Create new columns

In most cases, raw data is not very useful for meeting business requirements. It requires transformation. In the Song dataset, for example, we might be interested in knowing views per like for all songs or total views in million because raw numbers are difficult to interpret. These columns must be computed.

1. Create new column views per like
Songs_DF['View_Per_Like'] <-  Songs_DF['Views']/Songs_DF['Likes']

View_per_llike.png

2. Create new column Views in million
Songs_DF['Views_in_Million'] <-  Songs_DF['Views']/1000000
3. Create a new column with Title in a capital case

In this case, we have to apply a function that converts the title of each row into the upper case. To do this we have to create a function and apply this function to each row of the title column

# Create a function 
to_uppper_case <- function(x){
  return(toupper(x))
}
# test this function
to_uppper_case ("this is small")

# lapply will apply the function to_uppper_case  to all elements 
# of Songs_DF["Song.Title"]

#Apply this function
Songs_DF['Title2'] <- lapply(Songs_DF["Song.Title"], to_uppper_case)

Filter records from the datasets

Apart from the creation or deletion of columns, transformation also requires filtering of records. Many time it happens that are a lot of unnecessary records which is not required for analysis. For example, if in the song dataset, If I only want to analyze the song of Kishore Kumar, then we take a subset of data with Kishore Kumar records. Let's see how we can do that in R


#Get index of records with Singer as Kishore Kumar 
index <- Songs_DF$Singer == "Kishore Kumar"

#use that index to create Subset_DF
Subset_DF <- Songs_DF[index, ]

#The blank after, indicates all columns should be selected
#Select Singer and View Columns
Subset_DF <- Songs_DF[index, c("Singer", "Views") ]

Summary of Learning

  1. Selection of columns
  2. Creation of new columns
  3. Filter records