ANLY 501 Protfolio Data Cleaning

Fall 2020, Bo Yang

Cleaning Data Using R

The YouTube data was cleaned using R and Bilibili data was cleaned using Python. Firstly, read in the data. The raw dataset has 16 columns and 40949 rows. Printing out the column names for the dataset is the next step to see if there are some columns that is interesting about to see the distribution.
Since the project would like to know which type of video got the most views and likes, the histogram of those two columns were plotted to see if they have normal distribution. Since the number of views and likes are huge, when plotting the histograms, the log of those values were taken to have a better visual. Below are the histograms of views and likes.


Finding NA's is always important when we do data cleaning, so a for loop was used to iterate through all the columns of the dataset to show if there are any NA's inside that columns. And luckily, this dataset doesn't have any NA value.

For the category_id column, the original dataset are all numeric values that represent the id of each category. It's not easy to tell which category it is. Therefore, by going to the API website with the category chart, all the category id got changed into the category names.

I also want to know what percent each category have on these trending videos, so I do a pie chart of them.

I don't need the video id, title, tags, publish time, link and description columns. So I am going to remove those columns. Moreover,the dataset has three columns that shows if the video have disable comments, ratings, and deleted videos, so I will remove the rows with those values if the video has disabled comments, ratings, and videos. And then we will remove those three columns. Last but not least, since we have too many observations, I decide to narrow my data to only analyze the 2018 data.
For now, I have finished the initial data cleaning for my Youtube dataset. And I did the pie chart, the histograms again after my cleaning process. I began with 40949 rows, 16 columns and end with 30381 rows, 5 columns.





Cleaning Data in Python

I cleaned my Bilibili data using Python. I firstly read in my txt file and using pandas to make it a data frame. I printed the information of this dataframe and there are 12 columns and 2858844 rows in this dataset. After that, I printed the column names of the data frame to see if there are any unuseful columns that needs to be removed. I decided to remove video id, age, uploaderID. Same as what I did for Youtube data, I check for NA values, and remove all the rows with NA value because the videos are unique, replacing them with mean or median is not necessary. To make sure my Bilibili dataset is on the same level of my Youtube dataset, I filter the video type to only video upload and drop those unknown videos.


For now, my data cleaning process is almost finished, so I decided to make a bar plot to see if my cleaning is enough. Therefore, I aggregate the video type to count the number of videos for each type and plot a bar graph.


Below are the links of my complete datasets.
https://bellayang.georgetown.domains/USvideos.csv
https://bellayang.georgetown.domains/videos.txt

You could find all my codes below.
https://bellayang.georgetown.domains/Assignment2.py
https://bellayang.georgetown.domains/Assignment2R.Rmd