blog




  • Essay / Pavlo Comparison - 958

    The article “A Comparison of Large-Scale Data Analysis Approaches” by Pavlo, compares and analyzes the MapReduce framework with parallel DBMSs, for large-scale data analysis. It compares open source Hadoop, built on MapReduce, with two parallel SQL databases, Vertica and a second system from a major relational vendor (DBMS-X), to conclude that the parallel databases clearly outperform Hadoop on the same hardware over 100 knots. Averaged over 5 jobs across 100 nodes, Vertica was 2.3x faster than DBMS-X, which in turn was 3.2x faster than MapReduce. In general, parallel SQL DBMSs were significantly faster and required less code to implement each task, but took longer to tune and load the data. Finally, the article talks about the evolution of the APIs of these two classes of systems towards each other and ends with its visionary note on the integration of SQL with MapReduce. I found many flaws in this article and feel like it was written by relational database experts who are essentially ineffective in using the MapReduce framework. The article gives the impression that it was written by RDBMS supporters and it turns out that two of the authors were involved in the creation of Vertica. The paper compares their result on 100 nodes and states that not everything there is useful, which is not true. Google, Facebook, Yahoo! and other cooperations efficiently run their MapReduce jobs on around 1,000 nodes. This is also evident from the article “PigLatin: A Not So Foreign Language” [4] presented by the Cloud Nine team. As the team presented, PigLatin is used effectively at Yahoo! and is built on Hadoop. They also stated that part of the motivation for building PigLatin was the cost and inflexibility of parallel databases which even in the middle of paper...... genetic environment. Another important factor that I think is missing from Pavlo's paper is cost. The authors never mention the cost in the journal. MapReduce is designed to run on inexpensive products, while DBMSs can necessarily run well on such systems. In a nutshell, I feel that the authors failed to identify the problem area for their analysis. Their claims were too general, without much evidence. It would have been much better if they had performed their tests on a selected domain to determine where the DBMS or MapReduce could be more effective. Works cited6. http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/7. Health article from the Phoenix8 team. Nimbus9 Team. http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html10. MapReduce Paper11. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads