Tech Tips: SparkとHadoop MapReduceの違い

2016年6月12日日曜日

SparkとHadoop MapReduceの違い

-- Apache Hadoop logo and Spark log[1, 2] --

比較まとめ

	Hadoop MapReduce	Spark
速度	高速	MapReduceの10-100倍高速
データ	ディスクに保存ディスクIOに多くの時間を必要とし、レイテンシが大きい	メモリに保存レイテンシが小さい
Real-Time分析	バッチ処理用に設計されているため、得意ではない	ストリーミングデータの分散処理をサポート
Iterative Algorithm	iterationごとに、ディスクからの入力読込、ディスクへの出力書込が必要なため不向き	中間結果をキャッシュし、キャッシュに対して複数のiterationを走らせるため高速
Graph Algorithm	隣接ノードの情報をメッセージングする機構が備わっていない	GraphXというグラフアルゴリズムライブラリが含まれている

速度
MapReduceはHadoopクラスタのメモリを有効活用できていなかった。
SparkではRDD（Resilient Distributed Datasets）を使うことで、データをメモリに保存することができ、必要な場合にのみディスクへの保存を行うことができる。
これにより、SparkはHadoopよりも格段に高速である。

データ
Hadoopはデータをディスクに保存するが、Sparkはメモリに保存する。
SparkはRDD（Resilient Distributed Datasets）とよばれるデータストレージモデルを用いる。RDDはnetwork IOを最小化するフォールトトレランスの機構を提供する。RDDの一部のデータが失われた場合、lineage（データに提供された処理の履歴）を元に再構築が行われる。このためフォールトトレランスのためのレプリケーションが不要となる。
これに対して、Hadoopはフォールトトレランスのためのレプリケーションを必要とする。

Real-timeデータ分析
Twitterのデータを分析する場合などは、毎秒数百万単位で発生するイベントを処理する必要がある。Sparkの利点の一つは、データストリーミングの分散処理をサポートしている点である。標準で提供されるSpark Streamingライブラリを利用することで、バッチジョブを書く場合と同じ方法でストリーミングジョブを書くことができる。
これに対してMapReduceはバッチ分散処理用にデザインされているため、Real-time分析が不得意である。

Iterative Algorithm
多くのデータ分析アルゴリズムはiterative algorithmとよばれる繰り返し処理を必要とする。例えば、k-means、LDA、PageRankなどがその例である。
Hadoopの場合、各iterationでの計算結果をディスクに書き込み、次のiterationで結果をディスクから読み込むという処理が必要なため、iterative algorithmを高速に実行することは困難である。
Sparkではiterationごとの結果をメモリ上に保存しておけるため、高速に計算することができる。またSparkではMLlibというMachine Learning系の処理を行うためのライブラリが標準で提供されている。

Graph Algorithm
グラフ構造のデータに提供するアルゴリズムの多くでは、隣接するノードの情報が必要となる。例えば、PageRankの場合は、自身のノードにリンクを張っているノードのPageRank値が計算に必要になる。
Hadoopの場合、隣接ノードの情報をメッセージするための機能は提供されていない。これに対してSparkではGraphXという標準ライブラリを使うことで、グラフ系のアルゴリズムを効率的に計算することができる。Sparkは、NettyとAkkaのコンビネーションを使ってメッセージの配信を行っている。

参考URL
[1] Apache Hadoop logo, Apache Software Foundation - https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/, Apache License 2.0
[2] Spark Logo, Spark project team - Spark open source project - UC Berkeley, Apache License 2.0
[3] Apache Spark vs Hadoop MapReduce
[4] What is the difference between Apache Spark and Apache Hadoop (Map-Reduce) ? - Quora

11 件のコメント:

kitharington2016年7月14日 12:29

we have to share that the useful information i will be like this information

Seo Training |
Informatica Training |
Angularjs Training |
Tableau Training |
Hadoop Training
返信削除
返信
Unknown2016年9月23日 12:48
Marvelous blog with tons of valuable information. We also offers real time online training on Hadoop Admin Training | Devops Training | Data Science Training
返信削除
返信
Unknown2016年9月24日 11:55
Your blog is very unique and interesting. I gathered some needful information through your blog. Big Data Training | Oracle DBA Online Training | SQL Server DBA Online Training
返信削除
返信
Ram Niwas2019年3月11日 6:17
このコメントは投稿者によって削除されました。
返信削除
返信
gautham2019年9月23日 7:19
generally azure had a high demand in the cloud environment. thanks a lot for sharing azure training in hyderabad
返信削除
返信
ramesh2020年7月10日 15:09
The blog is really useful while reading every concept should be very neatly represented.

Microsoft Windows Azure Training in Chennai | Certification | Online Training Course | Microsoft Windows Azure Training in Bangalore | Certification | Online Training Course | Microsoft Windows Azure Training in Hyderabad | Certification | Online Training Course | Microsoft Windows Azure Training in Online | Certification | Online Training Course
返信削除
返信
rocky2020年7月22日 4:05
This blog is very useful for informative post. I learn this topic. I hope you to share more info about this. Keep posting Apache Spark Certification.
python training in chennai

python online training in chennai

python training in bangalore

python training in hyderabad

python online training

python flask training

python flask online training

python training in coimbatore

返信削除
返信
Pankaj Nagla2021年9月10日 22:50
I enjoyed this site. It was very easy to use and functional. Thanks.
Devops Course
返信削除
返信
kriti2021年9月11日 21:33
Nice Article. Very informative.
Training for devops
返信削除
返信
NIHARIKA DAS2021年10月13日 15:36
Nice article. This has a lot of information. Thank you. Also learn about python course and other courses to gain more knowledge and build your career
返信削除
返信
Online Shiksha | Online Tuition Classes and Education Tips2023年1月31日 12:21
First of all, I would like to thank you for sharing this great piece of content.
Math is one of the most important subjects for any student in any class. Mathematics is a subject that includes topics such as knowledge of numbers, shapes, spaces, volume, distance etc. It is seen that some students find math interesting, on the other hand some students find math as difficult. This is because most of the students take Maths Online Tuition classes from the starting of their session so that they can score good grades.
The Ultimate Guide to Stars and the Milky Way
返信削除
返信

コメントを追加

Tech Tips

Page List

Search on the blog

2016年6月12日日曜日

SparkとHadoop MapReduceの違い

11 件のコメント:

Blogger Syntax Highliter