Format Wars: From VHS and Beta to Avro and Parquet

Data Formats

There are several data formats to choose from to load your data into the Hadoop Distributed File System (HDFS). Each of the data formats has its own strengths and weaknesses, and understanding the trade-offs will help you choose a data format that fits your system and goals.

On this page we’ll provide you with a quick overview of the some of the most popular data formats and we’ll share the results of a series of tests we performed to compare them.

HOW TO CHOOSE A DATA FORMAT?

Choosing a data format is not always black and white, it will depend on several characteristics including:

Size and characteristics of the data
Project infrastructure
Use case scenarios
We discuss the different considerations in more detail in our blogpost.

DATA FORMATS TO THE TEST

Background

We set out to create some tests so we can compare the different data formats in terms of speed to write and speed to read a file. You can recreate the tests in your own system with the code used in this blogpost which can be found here: SVDS Data Formats Repository.

Disclaimer: We didn’t design these tests to act as benchmarks. In fact, all of our tests ran in different platforms with different cluster sizes. Therefore, you might experience faster or slower times depending on your own system’s configuration. The main takeaway is the difference of the different data formats performance within the same system.

The Setup

We performed tests in Hortonworks, Cloudera, Altiscale, and Amazon EMR distributions of Hadoop. For the writing tests we measured how long it took Hive to write a new table in the specified data format. For the reading tests we used Hive and Impala to perform queries and record the execution time each of the queries. We used snappy compression for most of the data formats, with the exception of Avro where we additionally used the deflate compression.

The queries ran to measure read speed, were in the form of:

SELECT COUNT(*) FROM TABLE WHERE …

Query 1 includes no additional conditions.
Query 2 includes 5 conditions.
Query 3 includes 10 conditions.
Query 4 includes 20 conditions.