Zero to Kaggle in 30 Minutes

June 24th, 2015

If you are not already familiar with it, Kaggle is a data science competition platform and community. Companies and researchers provide their datasets in hopes that the competing contestants will produce robust and accurate models that can be integrated into their business or research operations. In this post, we’ll walk through the steps for competing in Kaggle’s “Digit Recognizer” contest using SQL-based machine learning tools to identify hand-written digits.

MNIST database numbersThe Digit Recognizer contest makes use of the well-known MNIST database of handwritten digits, taken from American Census Bureau employees. The machine learning community discovered that the handwritten digit dataset is a great fit for analysis, and they are finding ways to improve digit recognizers by using it as a training set. Use cases for improved digit recognizers include — perhaps most notably — U.S. Postal Service mail sorting by ZIP code. If mail sorting systems/machines with digit recognizer software installed were deployed, the potential savings could be more than $1 million a day, given that USPS processes more than 21 million pieces of mail each hour, based on 10,000 sorters working at a $15 per hour rate.

The Tools

SQL (Structured Query Language) has been the Lingua Franca for data query needs since the 1980s. It is an expressive, declarative query language well-suited to analytics tasks, backed by databases, and still widely used today. It is often the chosen interface provided/supported by software applications. However, petabyte-scale NoSQL datastores (notably those built on the Hadoop ecosystem) use other programming paradigms (e.g. MapReduce) to execute queries. While they are easy to scale, these queries are challenging to develop and comprehend. What if you could use a standard SQL-compliant language to write queries but tap the scalability and capacity of Hadoop?

hawqPivotal HAWQ is a SQL engine that sits on top of Hadoop and HDFS, allowing complex SQL queries to run in massively parallel fashion. This puts it in stark contrast to traditional SQL warehouses, in which the backed storage consists of local file-systems tied to the query engine, and scaling efforts involve master-slave replication or sharding schemes—often resulting in solutions that are hard to maintain or evolve. HAWQ gives you the best of both worlds: SQL on the frontend and massive scale on the backend. It’s like the mullet of data query.

Pivotal Hawq

The MNIST dataset size is only about 100Mb: tiny by Hadoop standards. It’s easy to forget that HAWQ and other tools like HBase can scale to hundreds of gigabytes. HAWQ takes care of storing the data in distributive fashion quietly in the back, exposing a familiar SQL interface to the user.

MADlibIdentifying written characters is often approached as a machine learning problem, whereby a classifier is trained on a sample of data (e.g. my handwritten “8”) and used to classify new data (e.g. your handwritten “8”). MADlib is an analytics package that enables HAWQ users to perform many of the same statistical and machine learning techniques in SQL that are normally done in R or Python. Image classification is one common example. The MADlib approach stands in contrast to a “typical” model building workflow, in which data is exported and imported between the database and an external machine learning environment for model training.

The Plan

We will utilize MADlib on HAWQ to train a model using k-means clustering on the Digit Recognizer data, and submit our model predictions to Kaggle to determine our accuracy on digits. (Unlike some other contests, the accuracy score given in this contest is purely based on the test data given.)

At a high level, we will:

  • Create a “training_data” table and load training data into it.
  • Create a “test_data” table into which we will import all test and training data. The labeled training data will help determine cluster-to-digit mapping once clustering is complete. We’ll add the training data because resulted clustering IDs are arbitrary.
  • Run k-means clustering using a MADlib SQL query on the training table to determine cluster assignments. The cluster meaning will be determined by the cluster’s membership, e.g., all items in clusters with mostly “9” are assigned the label “9”.
  • Assign unlabeled test data to the nearest centroid (from prior model training step) and adopt that cluster’s label.

The instructions below are aided by the scripts located in my github repo. On the contest’s data page, there are two .csv files provided: Training data (a 74Mb .csv file) and Test data (a 49Mb .csv file).

Import the training dataset into HAWQ

Here’s what the training dataset from Kaggle looks like (everything before the $ is our Bash shell prompt):

[gpadmin@slave1 demo]$ head -1 1-train.csv
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,188...0,0

Convert training dataset for easier table import, delimited by “|”. Prepare the pixel values into array by enclosing with curly braces. (Script 2 is used to produce the following .csv file that makes training_data table insertion easier, where field 1 represents a digit label, and pixel values in field 2.)

[gpadmin@slave1 demo]$ head -1 1-psql_loadable.txt 1|{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,188...0,0}

Create “training_data” table in HAWQ to store training data:

gpadmin=# create table training_data (label smallint, val smallint[]);
gpadmin-# \d training_data;
Append-Only Table "public.training_data"
 Column |    Type    | Modifiers
--------+------------+-----------
 label  | smallint   |
 val    | smallint[] |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Distributed by: (label)

Insert training data:

copy training_data from '/home/gpadmin/demo/1-psql_loadable.txt' DELIMITER '|' CSV;

Verify 42000 rows of training data inserted:

gpadmin=# select count(*) from training_data;                                                                                                                                                     count
-------
 42000
(1 row)

Import the test dataset into HAWQ

We now create the test_data table using script 7. In addition to having a unique test_id column to identify each digit, we will also have a “label” column in the table. This is so that the added training data can be used to create clusters to which the digits are mapped, as mentioned in the plan’s Step 4.

Original test dataset in .csv format from Kaggle:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,17,17,17,17,81,180,180...0,0

The test dataset converted for PSQL loading. ‘|’ is again a delimiter here, where the first field is the label we are using for marking this row as test data (-1).

-1|1|{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,17,17,17,17,81,180,180...0,0}

Prepare some labeled data to be added into test data table. We add 20000 rows from training dataset for cluster labeling purposes after the fitting is done.

1|-1|{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,188,255,94,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,191,250,253,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,248,253,167...0,0}

Insert the test dataset and some label data into test table, beginning with the test data table definition:

gpadmin=# create table test_data (label smallint, test_id smallint, val smallint[]);
gpadmin=# copy test_data from '/home/gpadmin/demo/5-psql_loadable_test.txt' DELIMITER '|' CSV;
gpadmin=# copy test_data from '/home/gpadmin/demo/1-test_label.txt' DELIMITER '|' CSV;

gpadmin=# \d test_data;
Append-Only Table "public.test_data"
 Column  |    Type    | Modifiers
---------+------------+-----------
 label   | smallint   |
 test_id | smallint   |
 val     | smallint[] |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Distributed by: (label)

Look at the test_data table, and make sure all the test images are imported:

gpadmin=# select count(*) from test_data where label = '-1';
 count
-------
 28000
(1 row)

Also, import some training data into the test_data table:

gpadmin=# select count(*) from test_data where test_id = '-1';
 count
-------
 20000
(1 row)

Conduct k-means clustering

Before we run MADLib’s k-means clustering query, here’s a quick overview of how it works.

K-means clustering groups instances into k clusters (k is specified by the user) depending on their feature similarities. In this case, features are simply the pixel values of each 28×28 image, but they can also be calculated (e.g., the ratio of white to black pixels). The clustering algorithm proceeds iteratively: in each step, every item is assigned to its nearest (in Euclidean distance) cluster “mean” or centroid, and then a new cluster mean is calculated from those assigned points. The process terminates when the item assignments don’t change.

In this animation, the four green points on the left are the initial cluster means. Points are assigned to the nearest mean, and then new means that are calculated in each iteration until the assignments no longer change and the centroids are final. (Right click and choose “reset image” to replay the animation.)

K-means animation

This MADLib query retrieves centroids from the training_data table that starts with random centroid seeding (i.e., random centroid positions).

k-means Training Function documentation for random seeding
kmeans_random( rel_source - name of table containing input data,
               expr_point - name of column with point coordinates,
               k - number of centroids to calculate,
               fn_dist - distance function to use,
               agg_centroid - aggregate function for determining centroids 
             )

We know there are 10 digits to identify, and set k=10 for the number of clusters to calculate. The query above will return centroids using one of Euclidean, element-wise, distance and average aggregate functions.

select centroids from madlib.kmeans_random('training_data', 'val', 10, 'madlib.dist_norm2', 'madlib.avg')

We can now run closest_column to assign the test dataset according to the centroids data that was just computed, and store the results:

copy (select data.label, data.test_id, (madlib.closest_column(centroids, val)).column_id as cluster_id from test_data as data, (select centroids from madlib.kmeans_random('training_data', 'val', 10, 'madlib.dist_norm2', 'madlib.avg')) as centroids order by data.label) to '/home/gpadmin/demo/11-testout.txt';

The assignment query produces the following assignments:

label | test_id | cluster_id
-1      2       7
-1      3       9
-1      4       5
-1      5       8
-1      6       1
-1      7       7
-1      8       8
-1      9       7
-1      10      8
-1      11      3
-1      12      1
-1      13      9
-1      14      7
...
-1      27983   6
-1      27984   7
-1      27985   5
-1      27986   8
-1      27987   0
-1      27988   0
-1      27989   6
-1      27990   2
-1      27991   1
-1      27992   2
-1      27993   6
-1      27994   9
-1      27995   1
-1      27996   9
-1      27997   1
-1      27998   8
-1      27999   9
-1      28000   0

At the top of the assignments are the test data and each of its assigned cluster.

Note that the cluster_id isn’t the digit; the cluster_ids need to be mapped back to the corresponding digit by utilizing the labeled data that follows after the test data rows.

0       -1      7
0       -1      7
0       -1      7
0       -1      7
0       -1      7
0       -1      7
0       -1      3
0       -1      3
0       -1      7
0       -1      4
0       -1      7
0       -1      7
0       -1      2
...
1       -1      5
1       -1      6
1       -1      6
1       -1      6
1       -1      6
1       -1      6
1       -1      4
1       -1      6
1       -1      6
1       -1      6
...
2       -1      2
2       -1      2
2       -1      0
2       -1      0
2       -1      0
2       -1      0
2       -1      0
2       -1      0
2       -1      0
2       -1      5
2       -1      0
2       -1      0
...
3       -1      4
3       -1      8
3       -1      8
3       -1      8
3       -1      9
3       -1      8
3       -1      8
3       -1      8
3       -1      8
3       -1      9
3       -1      8
...
skipping digits 4 to 7
...
8       -1      4
8       -1      6
8       -1      4
8       -1      4
8       -1      4
8       -1      4
8       -1      4
8       -1      4
8       -1      6
8       -1      4
8       -1      8
...
9       -1      9
9       -1      9
9       -1      9
9       -1      0
9       -1      9
9       -1      9
9       -1      6
9       -1      5
9       -1      9
9       -1      9
9       -1      1
9       -1      9
9       -1      9

We can see the fitting accuracy is high enough that if we look at the labeled data rows, we quickly see a majority cluster for each digit—for example: cluster 7 for digit 0, cluster 6 for digit 1, cluster 0 for digit 2, cluster 8 for digit 3, and so on.

Prepare the submission file

To prepare the submission file, we simply determine the majority cluster_id for each digit and replace each calculated centroid with its corresponding mapped digit.

Here’s the map of the determined cluster to digit mapping:

{'1': '7', '0': '2', '3': '5', '2': '6', '5': '4', '4': '8', '7': '0', '6': '1', '9': '9', '8': '3'}

We now parse through the stored testout.txt, write out testdata lines, and replace its cluster with digits. The submission file looks like this:

"ImageId","Label"
1,"2"
2,"0"
3,"9"
4,"4"
5,"3"
6,"7"
7,"0"
8,"3"
9,"0"
10,"3"
11,"5"
12,"7"
13,"9"
14,"0"
15,"4"
16,"0"
17,"3"
18,"1"

Each line refers to Nth row in the test dataset, and its corresponding digit.

Upon submission to Kaggle, ranking in the competition is immediately determined. Looks like the clustering accuracy is about 80%, which isn’t great. But that is not bad at all for a dozen lines of SQL query to save the Postal Service some big bucks here!

Kaggle submission window

Conclusion

Hopefully this tutorial served as a helpful introduction. Keep in mind that this only a glimpse of what MADLib and HAWQ have to offer. Data analytic tasks are often iterative processes that require frequent tweaking of model parameters. In the case of k-means clustering here, we may wish to experiment with different initial groupings, distance functions, seeding ratios, and so on. Or we may wish to try a different clustering technique such as k-Nearest-Neighbors or Support Vector Machines—both of which have been shown to be great digit recognizing techniques and could be done with almost no data change, simply running the relevant queries.

The takeaway here is that having a workflow that requires minimal or no data movement can free the team to focus on solving the problem at hand. And if your project requires running analysis on large amounts of data where SQL is already part of the workflow, then HAWQ and MADLib may just be the right tools for you.