Big Data is here in a big way. Increasingly, companies are wishing to utilize the massive amounts of data stored on web users today. For example, think how much data is stored in the Facebook Open Graph. Another example would be the structure of the web itself, which is important to companies like search engines.
One of the best signs that Big Data is entering the mainstream is the number of tools growing around the popular NoSQL data platform Hadoop. Vendors of Business Intelligence products and open source projects like Apache Hive are seeking to put a SQL or standard reporting interface on top of the underlying data cluster. Once we have the SQL interface it's easy to imagine using tools like Hibernate to put another layer of indirection on top of that data.
So, in the very near future we may have the average developer performing operations on data stored on large clusters of machinery using plain old objects to access the underlying store. We all know abstractions sometimes leak, however. What may separate the master programmer from the competent journeyman is the ability to understand what's going on underneath all this abstraction.
While the competent programmer will be able to handle the majority of cases with his tools, there are going to be cases where the abstraction leaks. It will take a more skilled programmer, who understands what's going on underneath, to find the root of the problem, or to decide he must code one layer closer to the metal.
One layer closer to the metal than a SQL interface we have Apache Pig. This is a language used against Hadoop to process big data. From basic relational algebra we know many operations can be broken into smaller pieces. For example, an inner join can also be represented as a select into a cross join. Perhaps a better analogy when dealing with Pig, is the actual execution plans used by a relational database.
Fairly mundane SQL queries against a SQL data store are be broken up into an execution plan containing small steps like: stream processing, table scans, ordering, hash match joins, merge joins, etc. When using Pig, you are essentially performing queries at this level of detail.
You may even specify how the joins are performed, with keywords like replicated, skewed, or merge, and control over the number of underlying reduce tasks. The programmer must understand the conditions in which each type is possible or desirable to get maximum performance or avoid failure due to memory constraints.
Similarly, any cleaning of the data, or operations performed on each tuple, must be done in the step-by-step fashion with FOREACH. Any filtering, flattening, grouping, etc. is each done in a separate step. Still, Pig is a layer above the actual underlying MapReduce operations.
So I recently had another read of the MapReduce chapter, by Jeffrey Dean and Sanjay Ghemawat, in the excellent book Beautiful Code. MapReduce was developed at Google, and it's solves the types of problems the search engine giant needs to solve in a very elegant way. For example, coming up with a count of inbound links to target URLs given a bunch of documents containing links is a problem that lends itself to MapReduce programming.
First the chapter deals with alternative means of performing a task like counting word frequencies in a large number of documents. We want to be able to do this in parallel because there are a large number of documents on the web. Through conventional techniques we see that quickly what we are trying to do, count words, is lost amongst all the code for splitting the work onto different threads, different machines, network access, and for safely accessing shared data.
What MapReduce gives us is a driver framework that handles graceful failure, locking, network access, etc. for us. This allows the actual task being done to be in the expressed in the way not much more complicated than a simple single threaded implementation. Basically, MapReduce partitions the initial work on large data between a number of Map process. These processes suck in the data, for example, a bunch of documents. They then write records for each occurrence of what we're looking for, i.e. <docid,keyword> and write to an intermediate file on the distributed filesystem.
Then the Reduce tasks are going to take these intermediate files access them in a safe way, sort by key, and pass any values to the reduce function. Some problems, like a distributed sort, only require the identity Reduce function, which basically only sorts. In other cases like grouping all URLs that link to a given docid, there will be some kind of aggregation this point.
MapReduce is what is really going on under the covers when you're working with Big Data. Also, it is a very good example of how to solve a complex problem in an elegant way. The MapReduce driver hides most of the complexity from the application programmer. The authors claim it is commonly used at Google even by programmers who had never worked on this kind of complex distributed system before.
Beautiful Code is a great book in general and I think in the future I will share some more insights I found therein. I would highly recommend it to programmers so that they can sharpen their own skills by learning from the masters. For those working in Big Data in particular, even if their tools handle this level for them, certainly I'd consider the MapReduce section a must read.
One of the best signs that Big Data is entering the mainstream is the number of tools growing around the popular NoSQL data platform Hadoop. Vendors of Business Intelligence products and open source projects like Apache Hive are seeking to put a SQL or standard reporting interface on top of the underlying data cluster. Once we have the SQL interface it's easy to imagine using tools like Hibernate to put another layer of indirection on top of that data.
So, in the very near future we may have the average developer performing operations on data stored on large clusters of machinery using plain old objects to access the underlying store. We all know abstractions sometimes leak, however. What may separate the master programmer from the competent journeyman is the ability to understand what's going on underneath all this abstraction.
While the competent programmer will be able to handle the majority of cases with his tools, there are going to be cases where the abstraction leaks. It will take a more skilled programmer, who understands what's going on underneath, to find the root of the problem, or to decide he must code one layer closer to the metal.
One layer closer to the metal than a SQL interface we have Apache Pig. This is a language used against Hadoop to process big data. From basic relational algebra we know many operations can be broken into smaller pieces. For example, an inner join can also be represented as a select into a cross join. Perhaps a better analogy when dealing with Pig, is the actual execution plans used by a relational database.
Fairly mundane SQL queries against a SQL data store are be broken up into an execution plan containing small steps like: stream processing, table scans, ordering, hash match joins, merge joins, etc. When using Pig, you are essentially performing queries at this level of detail.
You may even specify how the joins are performed, with keywords like replicated, skewed, or merge, and control over the number of underlying reduce tasks. The programmer must understand the conditions in which each type is possible or desirable to get maximum performance or avoid failure due to memory constraints.
Similarly, any cleaning of the data, or operations performed on each tuple, must be done in the step-by-step fashion with FOREACH. Any filtering, flattening, grouping, etc. is each done in a separate step. Still, Pig is a layer above the actual underlying MapReduce operations.
So I recently had another read of the MapReduce chapter, by Jeffrey Dean and Sanjay Ghemawat, in the excellent book Beautiful Code. MapReduce was developed at Google, and it's solves the types of problems the search engine giant needs to solve in a very elegant way. For example, coming up with a count of inbound links to target URLs given a bunch of documents containing links is a problem that lends itself to MapReduce programming.
First the chapter deals with alternative means of performing a task like counting word frequencies in a large number of documents. We want to be able to do this in parallel because there are a large number of documents on the web. Through conventional techniques we see that quickly what we are trying to do, count words, is lost amongst all the code for splitting the work onto different threads, different machines, network access, and for safely accessing shared data.
What MapReduce gives us is a driver framework that handles graceful failure, locking, network access, etc. for us. This allows the actual task being done to be in the expressed in the way not much more complicated than a simple single threaded implementation. Basically, MapReduce partitions the initial work on large data between a number of Map process. These processes suck in the data, for example, a bunch of documents. They then write records for each occurrence of what we're looking for, i.e. <docid,keyword> and write to an intermediate file on the distributed filesystem.
Then the Reduce tasks are going to take these intermediate files access them in a safe way, sort by key, and pass any values to the reduce function. Some problems, like a distributed sort, only require the identity Reduce function, which basically only sorts. In other cases like grouping all URLs that link to a given docid, there will be some kind of aggregation this point.
MapReduce is what is really going on under the covers when you're working with Big Data. Also, it is a very good example of how to solve a complex problem in an elegant way. The MapReduce driver hides most of the complexity from the application programmer. The authors claim it is commonly used at Google even by programmers who had never worked on this kind of complex distributed system before.
Beautiful Code is a great book in general and I think in the future I will share some more insights I found therein. I would highly recommend it to programmers so that they can sharpen their own skills by learning from the masters. For those working in Big Data in particular, even if their tools handle this level for them, certainly I'd consider the MapReduce section a must read.
Post a Comment