ANLY 501 Protfolio Decision Trees

Fall 2020, Bo Yang

Using R for Record Data

The Youtube dataset was used to prepare for the decision tree building. Since only the record data with labels was required, selecting several columns to form the record dataset will be good. And the next step is to check the type of each column to make sure the following methods could be performed on the right track.

After finishing formatting with the dataset, the training set and test set were splitted from the who dataset. For the test data, the labels were removed and stored in somewhere else for future uses. Below is a look of the test set without label.

Three decision trees with three different root were created. The roots are likes, views, and comment counts. From the tree, we could see that the dataset doesn't form a very good decision tree. However, the three decision trees shared some common features. The videos got more likes, views, and comments tend to be entertainment videos, and the rest are music videos. Therefore, from the confusion matrix that is also attached below, the type for most videos are entertainment and music videos.








Using Python for Text Data

The YouTube dataset was also used to create the text data. The columns that were chosen are title, channel title, and tags. After read in the data, Tfidf vetorizer was used to vectorize the text data, and then a pandas dataframe with label was made to store the data. After splitting the testing and training set with a ratio 3:7, the confusion matrix was created, and printted out the top features which is tags.


For the visuals, plots that contain entropy and the decision tree plot were created. This dataset did very bad in the decision tree here, but pure children nodes from the root were found since the entropy are both zero.




Below are the links of my complete datasets.
https://bellayang.georgetown.domains/textdata.csv
https://bellayang.georgetown.domains/Usvideos.csv

You could find all my codes below.
https://bellayang.georgetown.domains/Assignment5.Rmd
https://bellayang.georgetown.domains/Assignment5.py