Data Architecture Reading List
July 15th, 2014
Databases sure ain’t what they used to be—it takes more than a relational database to put together a modern data architecture. The web and mobile have driven a host of scaling and robustness considerations for databases, with further increase around the corner, thanks to the internet of things. The rise of data science and analytics also creates new data use cases, which have a material effect on how you design a data architecture.
These new demands have led to an explosion in data technology. It seems that a new database system is announced weekly! New features for well-known relational databases; new arrivals to the NoSQL conglomerate of column stores, key-value stores, document stores, and graph databases; novel message queues and log processors. And of course, the rapid growth of the white-hot center of big data technology, Hadoop.
So how do you make sense of it? Thankfully, best practice is emerging, and much of it through the writings of experienced engineers at the forefront of distributed data systems. To help you navigate the new world of data architecture, here is a collection of some favorite readings.
Understanding New Tools
NoSQL Distilled: A Brief Guide To The Emerging World Of Polyglot Persistence
Sadalage and Fowler (Addison Wesley, 2012).
This book is a good reference for managing multiple data stores for different use cases across a business. It covers the best places to use different types of databases, such as key-value stores, document stores, RDBMS, and graphs. An example chapter is available online: Introduction to Polyglot Persistence: Using Different Data Storage Technologies for Varying Data Storage Needs.
A Modern Data Architecture with Apache Hadoop
Hortonworks, Inc. White paper.
This vendor white paper describes some of the major changes in approach Hadoop brings, such as “schema on read,” multiple use and workload processing, and low cost of data storage. At a high level, it explores how Hadoop will co-exist with other enterprise technologies in the long term.
Software Engineering
MAD Skills: New Analysis Practices for Big Data
Cohen, Dolan, Dunlap, Hellerstein, and Welton (VLDB ‘09).
This paper captures the essence of how data science in business is different from the worlds of business intelligence and data warehousing that went before. Drawing from work done at Fox Audience Network, the paper describes database design methodologies that support an agile working style for data scientists, and considers database system features that enable agile design and flexible algorithm development using both SQL and MapReduce.
Software Engineering Advice from Building Large-Scale Distributed Systems
Dean. Talk given at Stanford CS295 class lecture.
Jeff Dean has a singular reputation in distributed systems, and has been at the leading edge of Google engineering for many years. Much of his published work has subsequently become key parts of the big data toolbag, most notably MapReduce and BigTable. Dean’s Stanford talk provides great context for the working engineer on building scalable distributed systems. Drawing from hard-won experience, Dean talks about performance, scalability, and reliability, along with design patterns to help protect against failure scenarios.
Lambda and Log Architectures
Marz (nathanmarz.com, 2011).
Nathan Marz has been a prominent voice in the rethinking of data architecture for distributed systems. This influential post explodes the traditional view of the RDBMS on data. As Marz says, “a lot of people want a scalable relational database. What I hope you’ve realized in this post is that you don’t want that at all!” The alternative put forward by Marz is what has become known as the “Lambda architecture,” with different pathways for real-time and batch-processed data.
The Log: What every software engineer should know about real-time data’s unifying abstraction
Kreps (engineering.linkedin.com, 2013).
Jay Kreps asserts that “You can’t fully understand databases, NoSQL stores, key value stores, replication, Paxos, Hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them.” He shows the utility of a log-centric data processing architecture, as implemented at LinkedIn, and proposes an infrastructural model that puts the log at its core.
Big Data: Principles and best practices of scalable realtime data systems
Marz and Warren (Manning, November 2014 est.).
A book-length exposition of the Lambda Architecture, a good summary of which can be found in the first chapter, available for free online.
Applying the Big Data Lambda Architecture
Hausenblas (Dr. Dobb’s, 2013).
A worked example of the Lambda Architecture, using Hadoop, Hive, and HBase.
Google Cloud Implementation of Lambda Architecture
Google Cloud Platform Team (Hangout recording, 2014).
An extended discussion on setting up a Lambda Architecture using Google’s Cloud Platform.
Questioning the Lambda Architecture
Kreps (O’Reilly Radar, 2014).
In this critique of the Lambda Architecture, Kreps does a good job of explaining the architecture itself. While acknowledging the strengths of the Lambda Architecture, he suggests that in the long term, using two separate pathways for streaming and batch computation leads to odious software development difficulties. He proposes an evolution of the architecture, removing the batch processing pathway and reconciling it with the stream processing into one pathway.
Transactions in Distributed Data
Returning Transactions to Distributed Data Stores
Rosenthal and Pimentel (O’Reilly Strata Blog, 2013).
We recommend this paper when considering distributed solutions for transactional applications, with additional discussion on CAP (Consistency, Availability, Partition tolerance) trade-offs and local ACID (Atomicity, Consistency, Isolation, Durability) transactions. Recent advances in distributed databases, such as Google’sSpanner, point to an emerging new pattern in transactional database design.
Life beyond Distributed Transactions: an Apostate’s Opinion
Helland (CIDR 2007).
Noting the inherent difficulty of building architectures that support distributed transactions, Helland explores and names some of the practical approaches used in the implementations of large-scale mission-critical applications in a world which instead rejects distributed transactions.
Want More Resources?
My colleagues Richard Williamson, Stephen O’Sullivan, and John Akred will be giving a tutorial called Building a Data Platform at the Strata Conference + Hadoop World in New York this October. Registration is open now, with early bird pricing until July 31.