FMC1202: The Wonderfully Weird World of Software: Write-up for Amazon Dynamo & Google MapReduce

This seminar was conducted last friday by Ben Leong and I must say, this is one of the most technical and complicated seminars so far in FMC. However, I especially liked the quote that Ben gave - in school, we're just "Learning How To Learn". That's one lesson that I will definitely take home after the seminar.

On to the main gist of the seminar. There were a lot of technical parts that were discussed about the system that governs the data coming in and going out of Amazon and Google. Since these 2 companies are said to be the biggest internet companies in our present time, the amount of data entering their systems at any one time is HUGE due to the number of users. Therefore, their engineers had to come up with something to handle the problem of scale. This level of scale has never been seen before in the past since the web back then was much much smaller relative to now.

I will first touch on Google. Google came up with a system called Google MapReduce. This is a programming model for processing and generating large data sets. There are basically 2 stages of data processing that is in place for this system - MAP and REDUCE. Map is a process whereby parallelization is done. This breaks the huge amount of data into smaller parts for processing on different nodes so to increase efficiency and most importantly, speed. After which, REDUCE comes into play, whereby each set of data is combined to another set of key values which is known as the end product. The MapReduce process is done whereby a master computer controls and disseminates the map tasks to the various slave machines (or nodes) and then assign reduce tasks to another bunch of machines. They have also improved it by ensuring no nodes are idling at any one time. This speeds up things and ensure that resources are fully optimized.

The diagram above sums up the process of MapReduce. Since Google receives a lot of random data, MapReduce is somewhat a General Computational System whereby it just takes it all sorts of data and solve general problems. The idea of MapReduce is to do Abstraction, aka Divide and Conquer and also to trade off bandwidth for delay.

On the other hand, Amazon came up with a different idea and they called it Amazon Dynamo. Amazon is one of the top internet companies nowadays and there are a couple of reasons why. Since Amazon has the benefits of economies of scale, they are able to give bigger discounts. Also, people tend to shop online since it is so convenient. However, there are some disadvantages. As the company grew, they needed to manage their inventory and storage spaces better. Also, more and more data is being transferred in and out, hence they have to come up with a process to handle this large influx of data. On the contary to the Google MapReduce, Dynamo is a highly specialised and optimized system to solve very specific problems. This is because Amazon only handles goods that are being sold online. Their scope is more specific compared to Google and thus they devised a program that fits them better. Amazon realised that the needs of their business is specific - data has to be highly available for both the end users and themselves. Databases were used in the past but as Amazon grew larger, database was too slow. The storage in databases are also too general to fit the needs of Amazon since they only need the 'PUT' operation and the 'GET' operation. When dynamo was put in place, it was optimised to Amazon very well making the site fast and user-friendly. This has thus improved the efficiency of Amazon.

Both Google MapReduce and Amazon Dynamo solved the problems of scale. If this was left unresolved, bigger problems will arise and the company might even hit a wall in their growth. These 2 companies have top-notch engineers that come up with such good systems and that is why Google and Amazon are such big companies now.

1 comment:

Hong Dai ThanhNovember 3, 2009 at 6:15 AM
Map is a process whereby parallelization is done. This breaks the huge amount of data into smaller parts for processing on different nodes so to increase efficiency and most importantly, speed.

I think the mapping process is not only about parallelization, but it's also about simplifying the raw data to a smaller scale that make easier for processing.

FMC1202: The Wonderfully Weird World of Software

Sunday, November 1, 2009

Write-up for Amazon Dynamo & Google MapReduce

1 comment:

Followers

Blog Archive