When to Use Hadoop Instead of a Relational Database Management System (RDBMS)
With "big data" being such a buzz word these days, vendors are quickly attempting to tell you why their platform is better than all the rest. But in this blog, I am not going to be talking about vendors nor will I focus on their products. The basic question that I want to answer is one that I get asked the most by my colleagues, which is, “When should I use Hadoop instead of an RDBMS?”
So where did this question come from in the first place? We live in the age of data, everything we use in this digital age generates data and collectively, a pretty massive amount of it. In its 2011 report IDC estimated the size of the digital universe to be 1.8 Zetabytes. A zetabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as having one disk drive for every person in the world. This collection of data is locked up in the largest of web properties, such as search engines, scientific or financial institutions, etc. Of course, all of the event-oriented data that is being generated every day is known as BIG DATA. With the increasing democratization of data, the amount of public data that is being made available is also increasing every year. This is important because the success of organizations in the future not only depends on how they use their internally generated data but will be dictated largely by the extent of their ability to extract value from the public data and data available from other organizations as well.
The simple answer to the question, “When do I use Hadoop instead of an RDBMS?” would be to use Hadoop when you no longer can achieve the desired results with your RDBMS. Let’s look back to how these two technologies evolved and the kind of problems they are able to solve.
RDBMS technologies have evolved since the 1970’s all the way until present day, with broad use in financial reporting, enterprise resource planning, transaction processing and supply chain management for businesses across the world. The development of the RDBMS started at a time when the disks were costly and CPU’s were slow. Because of this, optimal and efficient use of available storage was very important. This is one of the reasons why data is normalized to retain integrity and remove redundancy and thereby minimize storage.
Further, Online Transaction Processing (OLTP) use cases were significant drivers for early relational databases. Thus RDBM systems were built to adhere to ACID principles in order to be considered reliable. And while adherance to ACID principles is required by many mission-critical operations, this capability does come with a price (e.g. management of transaction log to support rollbacks, etc.). Relational databases have been optimized to capture and analyze line items like retail orders, drugs that a patient is taking, supply chain and other tangible entities which naturally mapped to something in the real world. They are good for point queries or updates, where the dataset has been indexed to deliver fast retrieval and update time which is required especially in the case of systems which require low latency access. Most of the reporting systems today use some kind of a RDBMS as a data store because of this ability to provide low latency access to data.
Software used for analytics used to bring data from a data store (usually an RDBMS) into the analytics layer to perform analysis. But as the data size grew it became more difficult to bring data from the data layer into the analytics layer to perform computations. Meanwhile RDBMS technologies also evolved to address this problem of data transfer by providing built-in analytical functions. This however led to the RDBMS being asked questions it wasn't equipped to answer. Josh Wills from Cloudera defines this as a unit of analysis problem. Below are a few symptoms that are observed if you are facing this problem.
1. Count Distinct Queries. A count distinct query by definition has to process every record, including sorting and counting. There is not much an RDBMS query optimizer can do to work within this case; it’s just a matter of grinding through the data. And this becomes a difficult problem when the volume of data is huge.
Say you have a retail sales data warehouse setup in a traditional star schema and you want to find out how many people today bought a TV across all your stores in the US who have not previously purchased before. This is a very simple English question but very difficult to answer using a traditional RDBMS as it must scan vast quantities of data in order to calculate the result. For e.g. a RDBMS must query the results of the sales for each customer, filter new customer sales and then get the count of such customers. With even medium-sized databases, both dimensions may have tens of thousands of attributes. The combination of these dimensions generates a huge result set that needs to be analyzed. Mixing one or more such distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates causes more performance issues as it leads to spooling and re-reading of intermediate results.
2. Cursors. A cursor is where you are stepping through a table row by row in a database. If you are doing some analysis using some kind of a case statement using a cursor on each row of the database and if the table is of any significant size, this is a very bad situation. Cursors are good for iterating through small metadata tables. RDBM Systems are not optimized for stepping through large datasets one entry at a time.
3. Alter Table of Doom. You have a big data warehouse of a customer and in this data ware house you have a table which can be called an “ALTER TABLE of DOOM." This is a table which is so big and is so important with so many columns that if you want to alter it by adding a column, changing a column data type or running any DML operation, it would require a long time to complete. Such operations need to be planned and done very carefully as they lock out the table during this whole operation until the statement completes. In addition if the column that you are adding has a NOT NULL clause it would be very painful as the DBMS has to insert default values into all of the existing rows which may overburden your transaction logs.
4. Data Merge and Mashup. Most retailers today have both online and in-store presence. Consider a scenario where you have customers' online product search data (search logs) in the retailer’s website for the last 15 days, their past in-store purchase history (RDBMS), their in-store charge card transaction data and their daily commute pattern data that you have from their cellphone provider. If you want to build an analytical model that aims to combine these myriad sources of data to send custom discount offers that are valid in a specific store located along the customer’s daily commute path, then you would need to combine all of these sources of data to achieve this. It’s difficult to deal with unstructured data using an RDBMS, let alone combining unstructured data with structured.
For entities with multiple dimensions we need a different approach for structuring data and thinking about analytics. A customer entity has many dimensions because people are complicated. They have an entire transaction history, they have a bunch of friends, and they have their tweets along with a plethora of other social media components associated with them. This kind of rich repeated structure forms a hierarchy of data about this entity that you are interested to understand and analyze. These kinds of entities persist and evolve continuously. You have long-term relationships with your customers so you have to continuously add data to these kinds of complex entities as they are constantly evolving. Hadoop enables us to do this. Often people talk about Hadoop and focus on unstructured data. Personally I feel there is not such a thing as unstructured data; I consider unstructured data to be just noise. People have been using this term to refer to data that is not tabular and/or does not fit into a relational database schema naturally. For this type of data, we need to choose new ways of structuring it outside of a traditional DB schema.
Apache Thrift FB, Protocol Buffers at Google and Avro by Doug Cutting at Cloudera are some of the efforts in this direction. The idea behind all these formats is to serialize an entity object to disk and read it back via a whole bunch of programming languages. What’s great is that these formats are customizable. They help us avoid the "alter table of doom" scenario because it is really easy to add new fields to them. We don’t have to go rewrite all the old data the way we would do in a traditional alter table statement. They are very compact on disk despite their flexibility. They are fast to serialize and de-serialize and they are widely supported. You want to use Hadoop when you are faced with analyzing a multi-dimensional entity like a customer object and when your data size is in the order of terabytes/petabytes. While analyzing a customer object using Hadoop map reduce program, you can have all of the data that you have about a customer de-normalized and available to all at once. All their transactions, tweets, friends all available at one time. It’s sort of an analyst's dream. Hadoop’s MapReduce uses Sort/Merge which is much more efficient at rebuilding the entire database or updating the majority of a database than B-Tree (the data structure used in relational databases) especially when working with big data.
In my next blog I will be talking about the reasons that motivated the development of Hadoop, and how it is different from a traditional RDBMS.