How Do You Build a Data Product?

November 13th, 2015

We had a great time as part of the Datapalooza festival in San Francisco—a tech conference-meets-hackathon event, where attendees get to learn data science and also team together to build a complete data product over the three days. Having our Caltrain Rider app as an example of a data product, we were happy to share some of our stories.

What makes a data product?

Data products are the reason data scientists are lately treated like rockstars. They incorporate data science into the operation of a product or service, using data in smart ways to provide value. It’s more than just analysis: it’s putting insight into production. Every day we use the archetypal data product, Google search, and our every interaction with the service makes it better. Another famous early data product is LinkedIn’s “people you may know” feature, helping you locate people in your social networks.

Along the way at SVDS, we’ve learned a few things about data products, which we shared as we told the story of the Caltrain Rider app.

Data products can input data from their own usage to improve

By observing how users interact with your product, you can learn a lot. Through instrumenting the user interface, analyzing logs, or other ways of deriving data from users, you can gain extra signals that help improve your data modeling. In the case of Caltrain Rider, we are using GPS data from users’ phones to understand better the movement of trains within the system.

Data products are bootstrapped and then evolve

Good data products are rarely “done”—through usage and continued investigation, you start to understand better the problem that you’re trying to solve. One of the characteristics of working with data is that it’s best to work in an agile way: often you don’t even know the right question to ask until you’ve explored the problem space. Get a product in use early, then learn, adapt, and evolve the product.

Data products are best built with nimble, multifunctional teams

The rapid cycle of product evolution is best served by a multifunctional team of data scientists, engineers, product managers, and architects. To move fast with data, data scientists need to get the data from engineers, and insights and discoveries from the data science informs product direction. If these people are in disconnected departments, product development moves slowly and can be defeated by poor communication.

Multi-source data, because “GIGO” still applies

Every student learns that “garbage-in, garbage-out” is true of computer systems, and data products aren’t any different. If you don’t have good data going in, you won’t get a good result. However, that doesn’t mean you throw weak data away. Instead, by using as many diverse data sources as possible, you can create models that are robust in the face of any of the individual sources failing or being erroneous. With Caltrain Rider, we’re bringing in audio, video, GPS, and social signals in addition to schedule and API data.

Data products can learn things from a system that’s otherwise closed

One of the most exciting aspects of data science is that we can use observed data signals to predict the behavior of a system that we can’t directly access or comprehend. Without understanding the semantic import of every web page, search engines can still figure out which is most useful. With Caltrain Rider, we’re working to predict the behavior of the train system, without any special access to the system itself. This opens a world of opportunity for innovation and entrepreneurialism. If you can use data, you can crack pretty much any problem area you want. That’s why Silicon Valley companies are challenging the grocery, taxi, and entertainment industries, to name just a few.

Data products solve a real problem that people have

Technology is important, for sure. It can often make new things possible, and transform whole industries. But for successful products and companies, it’s always the problem that comes first. A great data product focuses relentlessly on solving the problem that the user has, using whatever data and techniques will help. With today’s proliferating options of platforms, tools, and languages, the only practical way to navigate these options is with a laser focus on how they can help solve the human and business problems at hand.

Diving deeper

If you’d like to look further at some of the analyses we’ve run in developing Caltrain Rider, we’ve made these available in a Github repository for Datapalooza. Using iPython notebooks, you can step interactively through analyzing video frames to identify trains and their directionality, and sentiment analysis on tweets from train riders.

Thanks to the Silicon Valley Data Science crew that supported Datapalooza: Stephen O’Sullivan, Chloe Mawer, Eric White, Ben Everson, Jeffrey Yau, and Harrison Mebane.