Tran Le
12/02/2020
The data are downloaded from https://www.kaggle.com/bistaumanga/usps-dataset
The dataset has 7291 train and 2007 test images. The images are 16*16 grayscale pixels
We use the USPS data, a database of scanned images of handwritten digits from US Postal Services envelopes. These handwritten digits were stretched in a rectangular box 16x16 in a gray scale of 256 values. Then each pixel of each image was scaled into a number from -1 to 1
The goal is to recognize digits given their 16x16 pixels
Apply ML methods to full data and data gotten using PCA for dimention reduction. Compare the accuracy and time run of those methods.
Investigating the PCA applied to the data set
Apply ML to full and dimension reduction data
We will chose to create 2 dimention reduction data:
The first one uses 81 PCs that explains higher than 90% of the variance.
The first one uses 64 PCs that explains higher than 80% of the variance.
This is a supervised problem, we can use these methods to classify the 10 groups:
k nearest neighbors
Classification tree
Random Forest
Support Vector Machine
Neural network
Each of the method will be applied before and after applied PCA to compare the accuracy and also the time needed to run for the full data and dimension reducted data .
The way applying:
Fit the model using training set.
Evaluate the accuracy on the test set.
Model_name | Accuracy_full | Runing_time_full | Accuracy_PCA64 | Run_time_PCA64 | Accuracy_PCA25 | Run_time_PCA25 |
---|---|---|---|---|---|---|
knn | 0.9367215 | 20.21794 secs | 0.1913303 | 3.272247 secs | 0.1893373 | 0.1878426 secs |
Model_name | Accuracy_full | Runing_time_full | Accuracy_PCA81 | Run_time_PCA81 | Accuracy_PCA64 | Run_time_PCA64 |
---|---|---|---|---|---|---|
rpart | 0.7249626 | 4.64187 secs | 0.7229461 | 3.357013 secs | 0.7229461 | 2.732178 secs |
Model_name | Accuracy_full | Runing_time_full | Accuracy_PCA81 | Run_time_PCA81 | Accuracy_PCA64 | Run_time_PCA64 |
---|---|---|---|---|---|---|
RandomForest | 0.9407075 | 1.536948 mins | 0.2705531 | 22.28762 secs | 0.245142 | 8.512539 secs |
Model_name | Accuracy_full | Runing_time_full | Accuracy_PCA81 | Run_time_PCA81 | Accuracy_PCA64 | Run_time_PCA64 |
---|---|---|---|---|---|---|
SVM | 0.941704 | 35.33119 secs | 0.1908321 | 14.17465 secs | 0.1908321 | 12.27648 secs |
Run the neural network with one hidden layer.
For the input layer, the full data give 256 nodes, the PCA81 gives 81 nodes and the PC64 gives 64 nodes.
The hidden layer will keep 90% the number of nodes of the input layer.
The output layer has 10 nodes.
Model_name | Accuracy_full | Runing_time_full | Accuracy_PCA81 | Run_time_PCA81 | Accuracy_PCA64 | Run_time_PCA64 |
---|---|---|---|---|---|---|
NeuralNetwork | 0.9367215 | 17.57725 secs | 0.1923269 | 1.729314 secs | 0.1788739 | 1.759174 secs |
The tree method does not give much different in the accuracy when apply to the full and dimension reduction data.
The SVM and randomforest are the winners in terms of accuracy.
The SVM is the winner in terms of accuracy and runing time.
Many of the methods uses default arguments. Some more modification can be applied to improve the accuracy.
The accuracy on the training set can be compute to analyize the properties of each method.
The number of hidden layers in the neural network and the proportion of nodes keep in each hidden layer can be changed to improve the accuracy