Adding Filter in Hadoop Mapper Class-mysql チュートリアル-php.cn

ホームページ

データベース

mysql チュートリアル

Adding Filter in Hadoop Mapper Class

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:30 PM

filterhadoopmapper

There is my solutions to tackle the disk spaces shortage problem I described in the previous post. The core principle of the solution is to reduce the number of output records at Mapper stage; the method I used is Filter, adding a filter, which I will explain later, to decrease the output records of Mapper, which in turn significantly decrease the Mapper’s Spill records, and fundamentally decrease the disk space usages. After applying the filter, with 30,661 records. some 200MB data set as inputs, the total Spill Records is 25,471,725, and it only takes about 509MB disk spaces!

Followed Filter

And now I’m going to reveal what’s kinda Filter it looks like, and how did I accomplish that filter. The true face of the FILTER is called Followed Filter, it filters users from computing co-followed combinations if their followed number does not satisfy a certain number, called Followed Threshold.

Followed Filter is used to reduce the co-followed combinations at Mapper stage. Say we set the followed threshold to 100, meaning users who doesn’t own 100 fans(be followed by 100 other users) will be ignored during co-followed combinations computing stage(to get the actual number of the threshold we need analyze statistics of user’s followed number of our data set).

Reason

Choosing followed filter is reasonable because how many user follows is a metric of user’s popularity/famousness.

HOW

In order to accomplish it, we need:

First, counting user’s followed number among our data set, which needs a new MapReduce Job;

Second, choosing a followed threshold after analyze the statistics perspective of followed number data set got in first step;

Third, using DistrbutedCache of Hadoop to cache users who satisfy the filter to all Mappers;

Forth, adding followed filter to Mapper class, only users satisfy filter condition will be passed into co-followed combination computing phrase;

Fifth, adding co-followed filter/threshold in Reducer side if necessary.

Outcomes

Here is the Hadoop Job Summary, after applying the followed filter with followed threshold of 1000, that means only users who are followed by 1000 users will have the opportunity to co-followed combinations, compared with the Job Summary in my previous post, most all metrics have significant improvements:

Counter	Map	Reduce	Total
Bytes Written	0	1,798,185	1,798,185
Bytes Read	203,401,876	0	203,401,876
FILE_BYTES_READ	405,219,906	52,107,486	457,327,392
*HDFS_BYTES_READ*	*203,402,751*	0	203,402,751
*FILE_BYTES_WRITTEN*	457,707,759	52,161,704	*509,869,463*
HDFS_BYTES_WRITTEN	0	1,798,185	1,798,185
Reduce input groups	0	373,680	373,680
Map output materialized bytes	52,107,522	0	52,107,522
Combine output records	22,202,756	0	22,202,756
*Map input records*	*30,661*	0	30,661
Reduce shuffle bytes	0	52,107,522	52,107,522
Physical memory (bytes) snapshot	2,646,589,440	116,408,320	2,762,997,760
*Reduce output records*	0	*373,680*	373,680
*Spilled Records*	*22,866,351*	2,605,374	25,471,725
Map output bytes	2,115,139,050	0	2,115,139,050
Total committed heap usage (bytes)	2,813,853,696	84,738,048	2,898,591,744
CPU time spent (ms)	5,766,680	11,210	5,777,890
Virtual memory (bytes) snapshot	9,600,737,280	1,375,002,624	10,975,739,904
SPLIT_RAW_BYTES	875	0	875
*Map output records*	*117,507,725*	0	117,507,725
Combine input records	137,105,107	0	137,105,107
Reduce input records	0	2,605,374	2,605,374

P.S.

Frankly Speaking, chances are I am on the wrong way to Hadoop Programming, since I’m palying Pesudo Distribution Hadoop with my personal computer, which has 4 CUPs and 4G RAM, in real Hadoop Cluster disk spaces might never be a trouble, and all the tuning work I have done may turn into meaningless efforts. Before the Followed Filter, I also did some Hadoop tuning like customed Writable class, RawComparator, block size and io.sort.mb, etc.

---EOF---

原文地址：Adding Filter in Hadoop Mapper Class, 感谢原作者分享。

声明

この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。

MySQLのライセンスは、他のデータベースシステムと比較してどうですか？Apr 25, 2025 am 12:26 AM

MySQLはGPLライセンスを使用します。 1）GPLライセンスにより、MySQLの無料使用、変更、分布が可能になりますが、変更された分布はGPLに準拠する必要があります。 2）商業ライセンスは、公的な変更を回避でき、機密性を必要とする商用アプリケーションに適しています。

MyisamよりもInnodbを選びますか？Apr 25, 2025 am 12:22 AM

Myisamの代わりにInnoDBを選択する場合の状況には、次のものが含まれます。1）トランザクションサポート、2）高い並行性環境、3）高いデータの一貫性。逆に、Myisamを選択する際の状況には、1）主に操作を読む、2）トランザクションサポートは必要ありません。 INNODBは、eコマースプラットフォームなどの高いデータの一貫性とトランザクション処理を必要とするアプリケーションに適していますが、Myisamはブログシステムなどの読み取り集約型およびトランザクションのないアプリケーションに適しています。

MySQLの外国キーの目的を説明してください。Apr 25, 2025 am 12:17 AM

MySQLでは、外部キーの機能は、テーブル間の関係を確立し、データの一貫性と整合性を確保することです。外部キーは、参照整合性チェックとカスケード操作を通じてデータの有効性を維持します。パフォーマンスの最適化に注意し、それらを使用するときに一般的なエラーを避けてください。

MySQLのインデックスのさまざまなタイプは何ですか？Apr 25, 2025 am 12:12 AM

MySQLには、B-Treeインデックス、ハッシュインデックス、フルテキストインデックス、空間インデックスの4つのメインインデックスタイプがあります。 1.B-Treeインデックスは、範囲クエリ、ソート、グループ化に適しており、従業員テーブルの名前列の作成に適しています。 2。HASHインデックスは、同等のクエリに適しており、メモリストレージエンジンのHASH_TABLEテーブルのID列の作成に適しています。 3。フルテキストインデックスは、記事テーブルのコンテンツ列の作成に適したテキスト検索に使用されます。 4.空間インデックスは、地理空間クエリに使用され、場所テーブルのGEOM列での作成に適しています。

MySQLでインデックスをどのように作成しますか？Apr 25, 2025 am 12:06 AM

tocreateanindexinmysql、usethecreateindexstatement.1）forasinglecolumn、 "createdexidx_lastnameonemployees（lastname）;" 2）foracompositeindexを使用して、 "createindexidx_nameonemployees（lastname、firstname）;" 3）; "3）、" 3）を使用します

MySQLはSQLiteとどのように違いますか？Apr 24, 2025 am 12:12 AM

MySQLとSQLiteの主な違いは、設計コンセプトと使用法のシナリオです。1。MySQLは、大規模なアプリケーションとエンタープライズレベルのソリューションに適しており、高性能と高い並行性をサポートしています。 2。SQLiteは、モバイルアプリケーションとデスクトップソフトウェアに適しており、軽量で埋め込みやすいです。

MySQLのインデックスとは何ですか？また、パフォーマンスをどのように改善しますか？Apr 24, 2025 am 12:09 AM

MySQLのインデックスは、データの取得をスピードアップするために使用されるデータベーステーブル内の1つ以上の列の順序付けられた構造です。 1）インデックスは、スキャンされたデータの量を減らすことにより、クエリ速度を改善します。 2）B-Tree Indexは、バランスの取れたツリー構造を使用します。これは、範囲クエリとソートに適しています。 3）CreateIndexステートメントを使用して、createIndexidx_customer_idonorders（customer_id）などのインデックスを作成します。 4）Composite Indexesは、createIndexIDX_CUSTOMER_ORDERONORDERS（Customer_Id、Order_date）などのマルチコラムクエリを最適化できます。 5）説明を使用してクエリ計画を分析し、回避します

データの一貫性を確保するために、MySQLでトランザクションを使用する方法を説明します。Apr 24, 2025 am 12:09 AM

MySQLでトランザクションを使用すると、データの一貫性が保証されます。 1）StartTransactionを介してトランザクションを開始し、SQL操作を実行して、コミットまたはロールバックで送信します。 2）SavePointを使用してSave Pointを設定して、部分的なロールバックを許可します。 3）パフォーマンスの最適化の提案には、トランザクション時間の短縮、大規模なクエリの回避、分離レベルの使用が合理的に含まれます。

See all articles

ホットAIツール

Undresser.AI Undress

リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover

写真から衣服を削除するオンライン AI ツール。

Undress AI Tool

脱衣画像を無料で

Clothoff.io

AI衣類リムーバー

Video Face Swap

完全無料の AI 顔交換ツールを使用して、あらゆるビデオの顔を簡単に交換できます。

ホットツール

SublimeText3 Mac版

神レベルのコード編集ソフト（SublimeText3）

mPDF

mPDF は、UTF-8 でエンコードされた HTML から PDF ファイルを生成できる PHP ライブラリです。オリジナルの作者である Ian Back は、Web サイトから「オンザフライ」で PDF ファイルを出力し、さまざまな言語を処理するために mPDF を作成しました。 HTML2FPDF などのオリジナルのスクリプトよりも遅く、Unicode フォントを使用すると生成されるファイルが大きくなりますが、CSS スタイルなどをサポートし、多くの機能強化が施されています。 RTL (アラビア語とヘブライ語) や CJK (中国語、日本語、韓国語) を含むほぼすべての言語をサポートします。ネストされたブロックレベル要素 (P、DIV など) をサポートします。