Railroad Modeling at Hadoop Scale

June 2nd, 2014

Data Scientist Tatsiana Maskalevich and CTO John Akred presented at this year’s Hadoop Summit in San Jose, showing how we brought together video, audio and social media to analyze an issue close to everyone’s heart in Silicon Valley: whether the Caltrain is running on time.

Railroad Modeling at Hadoop Scale

Hadoop is a very flexible platform for storing various disparate types of data—the variety dimension of the famous “3 Vs”. We will discuss how we use Hadoop to combine audio, video, and social media data sources to analyze Caltrain activity and provide real time insight into variance from the regular schedule.

This is an instance of the general problem of combining disparate data sources to reason about the current operational state of a business system. We will cover how we store raw sensor and social media data in HDFS then use various processing frameworks to refine that data and store it in HBase. Specific examples of using Hive, Flume, Python, SerDe, and Avro to take data from various inputs and perform necessary transformations to make the data suitable for analysis will be explained.

Finally, we will discuss how we develop and integrate the analytical components using Python’s Numpy, Scikit-learn, OpenCV, and Pandas libraries. The goal of the analyses is to recognize train sounds in audio streams, detect trains in video streams, and combine that with data from social media. Ultimately, we aim to determine what train is where, and how it is running relative to the schedule.