The following editor will bring you an article on implementing java web crawler in hadoop (example explanation). The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor and take a look.
The implementation of this web crawler will be related to big data. Based on the first two articles about implementing web crawlers in Java and implementing web crawlers in heritrix, this time we need to complete a complete data collection, data upload, data analysis, data result reading, and data visualization.
You need to use
Cygwin: a UNIX-like simulation environment running on the windows platform, directly search and download it online, and install it;
Hadoop : Configure the Hadoop environment and implement a distributed file system (Hadoop Distributed File System), referred to as HDFS, which is used to directly upload and save the collected data to HDFS, and then analyze it with MapReduce;
Eclipse: Write code, You need to import the hadoop jar package to create a MapReduce project;
Jsoup: HTML parsing jar package, combined with regular expressions, can better parse web page source code;
----- >
Directory:
1. Configure Cygwin
##2. Configuration Hadoop Huang Jing
3. Eclipse development environment setup
4. Network data crawling (jsoup)
-------->1. Install and configure Cygwin
Download the Cygwin installation file from the official website, address: https ://cygwin.com/install.htmlAfter downloading and running, enter the installation interface. Download the expansion package directly from the network image during installation. At least you need to select the ssh and ssl support package After installation, enter the cygwin console interface, Run ssh-host- config command, install SSHInput: no, yes, ntsec, no, noNote: under win7, it needs to be changed to yes, yes, ntsec, no, yes, enter the password and confirm this After step is completed, a Cygwin sshd service will be configured in the windows operating system and the service can be started.2. Configure the Hadoop environment
Modify the hadoop-env.sh file and add the JAVA_HOME location setting of the JDK installation directory.# The java implementation to use. Required. export JAVA_HOME=/cygdrive/c/Java/jdk1.7.0_67Note in the picture:
Program Files is abbreviated as PROGRA~1
##Modify hdfs -site.xml, set the storage copy to 1 (because the configuration is pseudo-distributed)
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Note: This picture has an extra property added , the content is to solve possible permission problems! ! ! HDFS: Hadoop Distributed File System
You can dynamically CRUD files or folders through commands in HDFS
Note that permissions may occur The problem needs to be avoided by configuring the following content in hdfs-site.xml:
<property> <name>dfs.permissions</name> <value>false</value> </property>
Modify mapred-site.xml, Set the server and port number where JobTracker runs (since it is currently running on this machine, just write localhost directly, and the port can be bound to any idle port)
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Configure core -site.xml, configure the server and port number corresponding to the HDFS file system (also on the current host)
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
After configuring the above content, enter hadoop in Cygwin Directory
In the bin directory, format the HDFS file system (must be formatted before first use), and then enter the startup command:
3. Eclipse development environment setup
This is what I wrote The general configuration method is given in Blog Big Data [2] HDFS deployment and file reading and writing (including eclipse hadoop configuration). However, it needs to be improved at this time.Copy the hadoop-eclipse-plugin.jar support package in hadoop to the plugin directory of eclipse to add Hadoop support to eclipse.
After starting Eclipse, switch to the MapReduce interface.
在windows工具选项选择showviews的others里面查找map/reduce locations。
在Map/Reduce Locations窗口中建立一个Hadoop Location,以便与Hadoop进行关联。
注意:此处的两个端口应为你配置hadoop的时候设置的端口!!!
完成后会建立好一个Hadoop Location
在左侧的DFS Location中,还可以看到HDFS中的各个目录
并且你可以在其目录下自由创建文件夹来存取数据。
下面你就可以创建mapreduce项目了,方法同正常创建一样。
4、网络数据爬取
现在我们通过编写一段程序,来将爬取的新闻内容的有效信息保存到HDFS中。
此时就有了两种网络爬虫的方法:
其一就是利用heritrix工具获取的数据;
其一就是java代码结合jsoup编写的网络爬虫。
方法一的信息保存到HDFS:
直接读取生成的本地文件,用jsoup解析html,此时需要将jsoup的jar包导入到项目中。
package org.liky.sina.save; //这里用到了JSoup开发包,该包可以很简单的提取到HTML中的有效信息 import java.io.File; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class SinaNewsData { private static Configuration conf = new Configuration(); private static FileSystem fs; private static Path path; private static int count = 0; public static void main(String[] args) { parseAllFile(new File( "E:/heritrix-1.12.1/jobs/sina_news_job_02-20170814013255352/mirror/")); } public static void parseAllFile(File file) { // 判断类型 if (file.isDirectory()) { // 文件夹 File[] allFile = file.listFiles(); if (allFile != null) { for (File f : allFile) { parseAllFile(f); } } } else { // 文件 if (file.getName().endsWith(".html") || file.getName().endsWith(".shtml")) { parseContent(file.getAbsolutePath()); } } } public static void parseContent(String filePath) { try { //用jsoup的方法读取文件路径 Document doc = Jsoup.parse(new File(filePath), "utf-8"); //读取标题 String title = doc.title(); Elements descElem = doc.getElementsByAttributeValue("name", "description"); Element descE = descElem.first(); // 读取内容 String content = descE.attr("content"); if (title != null && content != null) { //通过Path来保存数据到HDFS中 path = new Path("hdfs://localhost:9000/input/" + System.currentTimeMillis() + ".txt"); fs = path.getFileSystem(conf); // 建立输出流对象 FSDataOutputStream os = fs.create(path); // 使用os完成输出 os.writeChars(title + "\r\n" + content); os.close(); count++; System.out.println("已经完成" + count + " 个!"); } } catch (Exception e) { e.printStackTrace(); } } }
The above is the detailed content of Introduction to the implementation method of Java web crawler in hadoop. For more information, please follow other related articles on the PHP Chinese website!

Java错误:Hadoop错误,如何处理和避免当使用Hadoop处理大数据时,常常会遇到一些Java异常错误,这些错误可能会影响任务的执行,导致数据处理失败。本文将介绍一些常见的Hadoop错误,并提供处理和避免这些错误的方法。Java.lang.OutOfMemoryErrorOutOfMemoryError是Java虚拟机内存不足的错误。当Hadoop任

本篇文章给大家带来了关于java的相关知识,其中主要介绍了关于结构化数据处理开源库SPL的相关问题,下面就一起来看一下java下理想的结构化数据处理类库,希望对大家有帮助。

随着大数据时代的到来,数据处理和存储变得越来越重要,如何高效地管理和分析大量的数据也成为企业面临的挑战。Hadoop和HBase作为Apache基金会的两个项目,为大数据存储和分析提供了一种解决方案。本文将介绍如何在Beego中使用Hadoop和HBase进行大数据存储和查询。一、Hadoop和HBase简介Hadoop是一个开源的分布式存储和计算系统,它可

随着数据量的不断增大,传统的数据处理方式已经无法处理大数据时代带来的挑战。Hadoop是开源的分布式计算框架,它通过分布式存储和处理大量的数据,解决了单节点服务器在大数据处理中带来的性能瓶颈问题。PHP是一种脚本语言,广泛应用于Web开发,而且具有快速开发、易于维护等优点。本文将介绍如何使用PHP和Hadoop进行大数据处理。什么是HadoopHadoop是

本篇文章给大家带来了关于java的相关知识,其中主要介绍了关于多线程的相关问题,包括了线程安装、线程加锁与线程不安全的原因、线程安全的标准类等等内容,希望对大家有帮助。

Java大数据技术栈:了解Java在大数据领域的应用,如Hadoop、Spark、Kafka等随着数据量不断增加,大数据技术成为了当今互联网时代的热门话题。在大数据领域,我们常常听到Hadoop、Spark、Kafka等技术的名字。这些技术起到了至关重要的作用,而Java作为一门广泛应用的编程语言,也在大数据领域发挥着巨大的作用。本文将重点介绍Java在大

本篇文章给大家带来了关于java的相关知识,其中主要介绍了关于枚举的相关问题,包括了枚举的基本操作、集合类对枚举的支持等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Java的相关知识,其中主要介绍了关于关键字中this和super的相关问题,以及他们的一些区别,下面一起来看一下,希望对大家有帮助。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.