Monday, November 2, 2009

MapReduce, from a developer's perspective.

After learning about google's MapReduce, there was much more to think outside of it. Have you ever wondered why it's used?

If MapReduce concept was to be compared to something, it would be a Database Management System, only at lower level. At times, when doing development, there might be huge datasets that needs to be processed within whatever is stored in the database. Take for example if i need to process really large datasets, i can choose to run everything on one computer or run it in a distributed manner as how MapReduce proposes it.

The important thing about doing it the MapReduce method is that, I NEED MACHINES! This concept is no good if i do not have machines or nodes to do the processing, be it virtual or physical machines. If i don't have resource, i won't be able to exploit this concept. While i can propose to my company to purchase 100 or 1000 of machines, they might not want to waste money buying array of computers just for me to do processing. Think about the space, maintenance, heat (need better cooling system), network bandwidth taken and power consumption drawn from theses machines.

So thinking in the other direction, instead of breaking down your task into smaller parts and giving them to a lot of computers, why not increase the amount of processing power you have on 1 "single" machine? Buy a super computer? No.

Clustering on cloud computing. 

Basically for cloud computing, it adopts a utility model where you only pay for whatever computation time/power you use. Because it is virtual, it is easy to scale and add processing power and resources. Think of it as getting many computers to group together and behave like 1 big super computer to do your calculations. Cloud computing helps to save the companies money by reduce the need to purchase and maintain assets and makes life easier as it is easily scalable. 

While cloud computing is an viable alternative, i honest have no idea which one will be more economical in the long run. If you have information to process on a very regular basis, it might be calculated to be cheaper to own your own computer farm. If you only need to process your datasets for a few times, it might be cheaper to buy cloud computing services. Ultimately, it about choosing the right tool for the right job.

No comments:

Post a Comment

Followers