[ Part 5] - grouping and aggregation of data in R using RStudio

[ Part 5] - grouping and aggregation of data in R using RStudio

Summary of data is the first thing a leader or senior stakeholder will look into. For example, whenever we see a dataset the first thought that comes to our mind is what is an average value, what is the count etc.

After reading this article, you will be able to

  1. Group data on some pre-defined criteria
  2. Calculate an aggregate value or summary based on the group

We'll use the songs dataset for all illustrations. You can download the song dataset by clicking here.

#  read the dataset
Songs_DF <-  read.csv("Hindi_Songs.csv")

image.png

Group by

In order to aggregate data, the first step is to group the data based on pre-defined criteria. The next step is to calculate the statistics for that group. If the data is numerical the statistics can be avg, sum, min, max, etc. If the data is non-numerical then statistics can be count, unique count, etc.

In this example, we'll use the group the data based on singer and calculate avg view, min view and max view for each singer. We'll also calculate the count of actors and unique actors worked with each singer.

Songs_DF %>%  
  group_by(Singer) %>%
  summarise(Avg_views = mean(Views), 
            Min_view =  min(Views), 
            Max_View = max(Views), 
            Actor_Count = length(Lead.Actor), 
            Unique_Actor = length(unique(Lead.Actor))) %>% View()

Group_aggregate.png

length(Lead.Actor) function will calculate the length of lead actors while length(unique(Lead.Actor)) will calculate length of unique lead actors.

Output

image.png

Summary of Learning

  1. Groupby
  2. Summarize
  3. %>% operator