ANLY 501 Protfolio Naive Bayes and Support Vector Machine

Fall 2020, Bo Yang

Naive Bayes and SVM Using Python

Text Data

For my text data, I select several columns from my raw Youtube dataset which are tags, title, and channel title. And I make each observation for each column a text file to make three folders called tags, title, and channel title. That is my final corpus. After read in the data, I first vectorize the text data with labels. And I used a ratio 1:3 to form my training set and testing set. Below is the graph of my actual label and my predict label using Naive Bayes. From the plot, you could see that my prediction was actually not bad. For title and channel title, my prediction is doing really well. And for tags, there are several wrong predictions. However, the overall results are pretty good.


And I also printed out the confusion matrix for this NB prediction and the actual labels to get a sense of what I got.


For SVM of my text data, I run the SVM twice using c = 1 and c = 10. The results I got didn't have much difference besides one prediction among 451 test data. Below is the confusion matrix for these two predictions and how the actual label look. If you would like to see the prediction label, you could download my code at the bottom of this page since the prediciton labels are quite large.

Record Data

For my record data, I used my Youtube dataset with labels that contain dislikes, views, and comment_count which are all numeric data. Below is a quick overview of how the dataset looks like.

After read in the data, I split the data with a ratio 1:3 the same as what I did for my text data. After I did Naive Bayes, I printed out the actual labels and the corresponding confusion matrix. I also included a prediction probability matrix inside my code, and from the matrix, I could tell that my prediction works pretty well. Moreover, I renamed my labels with 0,1, and 2 and plotted a scatter plot of the PCA results.



For SVM, I used three kernels to run my code. I used linear kernel, polynomial kernel, and rbf kernel. From the confusion matrix, you could see that there is not a significant difference between these three kernels.



Naive Bayes and SVM Using R

Record Data

The record dataset I used is the same as what I used in python, so I will not describe that again. However, below are several screenshots of the dataset and training set for a reminder in case people forgot it.



After I run Naive Bayes with my category_id to be the target and all the other three columns as independent variables, the below graph shows the results. From the table of actual label and prediction label, I could tell that the result is not as good as what my python NB tell me. Moreover, I also included a histogram shows my prediction results.




I also run the SVM with a linear kernel and plot the results. After the plot, I calculate the mis-classification rate which is 0.47. It is quite high than my python result. So I think I still need some more improvements.

Text Data

For the text corpus, I select the tags corpus as my text data. After vectorize the data and make it into dtm format, I run the frequency for this corpus.

In order to do Naive Bayes and SVM, I normalized my dataset. Below you could see the original data matrix and normalized data matrix. After splitting the training set and testing set, I run the Naive Bayes and SVM with linear kernel. The prediction results didn't work as well as python, but you could still see that love is the most popular word for all tags.






Below are the links of my complete datasets.
https://bellayang.georgetown.domains/recorddata.csv
Corpus Datasets

You could find all my codes below.
https://bellayang.georgetown.domains/Assignment6.Rmd
https://bellayang.georgetown.domains/Assignment6.py