MongoDB and Hadoop: A Step-by Step Tutorial Using-mysql教程-PHP中文网

首页

数据库

mysql教程

MongoDB and Hadoop: A Step-by Step Tutorial Using

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:29 PM

andhadoopmongodb

The following is a guest post from Jeremy Karn. This article is excerpted from MongoDB + Hadoop: A Step-by-Step Tutorial. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework

The following is a guest post from Jeremy Karn. This article is excerpted from ‘MongoDB + Hadoop: A Step-by-Step Tutorial’. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

People who are worried about scalability often find themselves looking at two tools: MongoDB for storing large amounts of data easily and Hadoop for processing that data. But a common question is: “How do I combine these two to really get the most out of my data?”

Here’s a step-by-step tutorial that will get you up and running with MongoDB and Hadoop in a matter of minutes. And the best part about this tutorial is that at the end you’ll be ready to jump right into using your own MongoDB data with Hadoop.

For this tutorial you’ll be using Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar, it is like procedural SQL.

To run your Hadoop jobs, you’re going to use a free Mortar account. Mortar provides Hadoop as a service, which means you can run your jobs without worrying about how to set up and manage a multi-node Hadoop cluster.

To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.

We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:

If you don’t already have a free Github account - create one.? You’ll need a github username in step 4.

Sign into (or create) your free Mortar account.
After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
Install?the Mortar Development Framework:?
```
gem install mortar
```

Clone the example git project and register it as a mortar project:?

git clone git@github.com:mortardata/mongo-pig-examples.git

cd mongo-pig-examples

mortar register mongo-pig-examples

Script 1 - Characterize Collection

If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.

From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.

To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:

mortar run characterize_collection --clustersize 4

This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs?

Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:

…
user.is_translator	2	false	unicode	118806
user.is_translator	2	true	unicode	31
user.lang	26	en	unicode	114108
user.lang	26	es	unicode	3462
user.lang	26	fr	unicode	532
user.lang	26	pt	unicode	281
user.lang	26	ja	unicode	79
user.listed_count	398	0	int	73757
user.listed_count	398	1	int	18518

Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.

Script 2 - MongoDB Schema Generator

It can be tricky to properly declare MongoDB’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.

So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.

Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:

mortar run mongo_schema_generator

If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:

mortar run mongo_schema_generator --clustersize 4

Script 3 – Twitter Hourly Coffee Tweets

Using a Twitter coffee tweets script (pigscripts/hourly_coffee_tweets.pig), we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.

Next Steps

If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:

Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at support@mortardata.com to allow your Mortar account to access that port.
If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
You can find more resources for learning Pig here
If you have any questions or feedback, please contact us at support@mortardata.com or ping us on in-app chat at app.mortardata.com

原文地址：MongoDB and Hadoop: A Step-by Step Tutorial Using , 感谢原作者分享。

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

MySQL的位置：数据库和编程Apr 13, 2025 am 12:18 AM

MySQL在数据库和编程中的地位非常重要，它是一个开源的关系型数据库管理系统，广泛应用于各种应用场景。1）MySQL提供高效的数据存储、组织和检索功能，支持Web、移动和企业级系统。2）它使用客户端-服务器架构，支持多种存储引擎和索引优化。3）基本用法包括创建表和插入数据，高级用法涉及多表JOIN和复杂查询。4）常见问题如SQL语法错误和性能问题可以通过EXPLAIN命令和慢查询日志调试。5）性能优化方法包括合理使用索引、优化查询和使用缓存，最佳实践包括使用事务和PreparedStatemen

MySQL：从小型企业到大型企业Apr 13, 2025 am 12:17 AM

MySQL适合小型和大型企业。1)小型企业可使用MySQL进行基本数据管理，如存储客户信息。2)大型企业可利用MySQL处理海量数据和复杂业务逻辑，优化查询性能和事务处理。

幻影是什么读取的，InnoDB如何阻止它们（下一个键锁定）？Apr 13, 2025 am 12:16 AM

InnoDB通过Next-KeyLocking机制有效防止幻读。1）Next-KeyLocking结合行锁和间隙锁，锁定记录及其间隙，防止新记录插入。2）在实际应用中，通过优化查询和调整隔离级别，可以减少锁竞争，提高并发性能。

mysql：不是编程语言，而是...Apr 13, 2025 am 12:03 AM

MySQL不是一门编程语言，但其查询语言SQL具备编程语言的特性：1.SQL支持条件判断、循环和变量操作；2.通过存储过程、触发器和函数，用户可以在数据库中执行复杂逻辑操作。

MySQL：世界上最受欢迎的数据库的简介Apr 12, 2025 am 12:18 AM

MySQL是一种开源的关系型数据库管理系统，主要用于快速、可靠地存储和检索数据。其工作原理包括客户端请求、查询解析、执行查询和返回结果。使用示例包括创建表、插入和查询数据，以及高级功能如JOIN操作。常见错误涉及SQL语法、数据类型和权限问题，优化建议包括使用索引、优化查询和分表分区。

MySQL的重要性：数据存储和管理Apr 12, 2025 am 12:18 AM

MySQL是一个开源的关系型数据库管理系统，适用于数据存储、管理、查询和安全。1.它支持多种操作系统，广泛应用于Web应用等领域。2.通过客户端-服务器架构和不同存储引擎，MySQL高效处理数据。3.基本用法包括创建数据库和表，插入、查询和更新数据。4.高级用法涉及复杂查询和存储过程。5.常见错误可通过EXPLAIN语句调试。6.性能优化包括合理使用索引和优化查询语句。