Advanced Machine Learning Final project Presentation

Tran Le

12/02/2020

Introduction

The project is about using and compare suppervised Machine Learning method to recognize USPS digit.

Data

The data are downloaded from https://www.kaggle.com/bistaumanga/usps-dataset
The dataset has 7291 train and 2007 test images. The images are 16*16 grayscale pixels
We use the USPS data, a database of scanned images of handwritten digits from US Postal Services envelopes. These handwritten digits were stretched in a rectangular box 16x16 in a gray scale of 256 values. Then each pixel of each image was scaled into a number from -1 to 1

Fig1: A figure in the data

Goal of the project

The goal is to recognize digits given their 16x16 pixels
Apply ML methods to full data and data gotten using PCA for dimention reduction. Compare the accuracy and time run of those methods.

Detail of the project, the outline

Investigating the PCA applied to the data set
Apply ML to full and dimension reduction data

First, Investigating the PCA applied to the data set

Fig2: Number of PCs vs cummulative proportion of variance explained

We will chose to create 2 dimention reduction data:
- The first one uses 81 PCs that explains higher than 90% of the variance.
- The first one uses 64 PCs that explains higher than 80% of the variance.

Second, apply ML to full and dimension reduction data:

This is a supervised problem, we can use these methods to classify the 10 groups:
- k nearest neighbors
- Classification tree
- Random Forest
- Support Vector Machine
- Neural network
Each of the method will be applied before and after applied PCA to compare the accuracy and also the time needed to run for the full data and dimension reducted data .
The way applying:
- Fit the model using training set.
- Evaluate the accuracy on the test set.

Detail of how to run each methods and the results:

K nearest neighbors:

Apply knn with k in \(\{1, ..., 10\}\). Choose the one with highest testing accuracy.

Model_name	Accuracy_full	Runing_time_full	Accuracy_PCA64	Run_time_PCA64	Accuracy_PCA25	Run_time_PCA25
knn	0.9367215	20.21794 secs	0.1913303	3.272247 secs	0.1893373	0.1878426 secs

Classification tree

Run the rpart to build tree on the training set, then prune the tree, using the cost parameter that minimize the lost function. Use this tree to measure the accuracy on the test set.

Model_name	Accuracy_full	Runing_time_full	Accuracy_PCA81	Run_time_PCA81	Accuracy_PCA64	Run_time_PCA64
rpart	0.7249626	4.64187 secs	0.7229461	3.357013 secs	0.7229461	2.732178 secs

Random Forest:

Run random forest with default arguments

Model_name	Accuracy_full	Runing_time_full	Accuracy_PCA81	Run_time_PCA81	Accuracy_PCA64	Run_time_PCA64
RandomForest	0.9407075	1.536948 mins	0.2705531	22.28762 secs	0.245142	8.512539 secs

Support Vector Machine

Run Support Vector Machine with default arguments

Model_name	Accuracy_full	Runing_time_full	Accuracy_PCA81	Run_time_PCA81	Accuracy_PCA64	Run_time_PCA64
SVM	0.941704	35.33119 secs	0.1908321	14.17465 secs	0.1908321	12.27648 secs

Neural network:

Run the neural network with one hidden layer.
- For the input layer, the full data give 256 nodes, the PCA81 gives 81 nodes and the PC64 gives 64 nodes.
- The hidden layer will keep 90% the number of nodes of the input layer.
- The output layer has 10 nodes.

Model_name	Accuracy_full	Runing_time_full	Accuracy_PCA81	Run_time_PCA81	Accuracy_PCA64	Run_time_PCA64
NeuralNetwork	0.9367215	17.57725 secs	0.1923269	1.729314 secs	0.1788739	1.759174 secs

Summarize the results and observation

The tree method does not give much different in the accuracy when apply to the full and dimension reduction data.
The SVM and randomforest are the winners in terms of accuracy.
The SVM is the winner in terms of accuracy and runing time.

Limitation and future development:

Many of the methods uses default arguments. Some more modification can be applied to improve the accuracy.
The accuracy on the training set can be compute to analyize the properties of each method.
The number of hidden layers in the neural network and the proportion of nodes keep in each hidden layer can be changed to improve the accuracy