search
HomeJavajavaTutorialIntroduction to the implementation method of Java web crawler in hadoop

The following editor will bring you an article on implementing java web crawler in hadoop (example explanation). The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor and take a look.

The implementation of this web crawler will be related to big data. Based on the first two articles about implementing web crawlers in Java and implementing web crawlers in heritrix, this time we need to complete a complete data collection, data upload, data analysis, data result reading, and data visualization.

You need to use

Cygwin: a UNIX-like simulation environment running on the windows platform, directly search and download it online, and install it;

Hadoop : Configure the Hadoop environment and implement a distributed file system (Hadoop Distributed File System), referred to as HDFS, which is used to directly upload and save the collected data to HDFS, and then analyze it with MapReduce;

Eclipse: Write code, You need to import the hadoop jar package to create a MapReduce project;

Jsoup: HTML parsing jar package, combined with regular expressions, can better parse web page source code;

----- >

Directory:

1. Configure Cygwin

##2. Configuration Hadoop Huang Jing

3. Eclipse development environment setup

4. Network data crawling (jsoup)

-------->

1. Install and configure Cygwin

Download the Cygwin installation file from the official website, address: https ://cygwin.com/install.html

After downloading and running, enter the installation interface.

Download the expansion package directly from the network image during installation. At least you need to select the ssh and ssl support package

After installation, enter the cygwin console interface,

Run ssh-host- config command, install SSH

Input: no, yes, ntsec, no, no

Note: under win7, it needs to be changed to yes, yes, ntsec, no, yes, enter the password and confirm this After step

is completed, a Cygwin sshd service will be configured in the windows operating system and the service can be started.

Then configure ssh password-free login

Rerun cygwin.

When you execute ssh localhost, you will be asked to log in with a password.

Use the ssh-keygen command to generate an ssh key and press Enter to end.

After generation, enter the .ssh directory and use the command: cp id_rsa.pub authorized_keys command to configure the key.

Then use exit to exit.

After re-entering the system, you can directly enter the system through ssh localhost without entering a password.

2. Configure the Hadoop environment

Modify the hadoop-env.sh file and add the JAVA_HOME location setting of the JDK installation directory.


# The java implementation to use. Required.

export JAVA_HOME=/cygdrive/c/Java/jdk1.7.0_67

Note in the picture:

Program Files is abbreviated as PROGRA~1

##Modify hdfs -site.xml, set the storage copy to 1 (because the configuration is pseudo-distributed)

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Note: This picture has an extra property added , the content is to solve possible permission problems! ! ! HDFS: Hadoop Distributed File System

You can dynamically CRUD files or folders through commands in HDFS

Note that permissions may occur The problem needs to be avoided by configuring the following content in hdfs-site.xml:

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Modify mapred-site.xml, Set the server and port number where JobTracker runs (since it is currently running on this machine, just write localhost directly, and the port can be bound to any idle port)

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Configure core -site.xml, configure the server and port number corresponding to the HDFS file system (also on the current host)

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

After configuring the above content, enter hadoop in Cygwin Directory

In the bin directory, format the HDFS file system (must be formatted before first use), and then enter the startup command:

3. Eclipse development environment setup

This is what I wrote The general configuration method is given in Blog Big Data [2] HDFS deployment and file reading and writing (including eclipse hadoop configuration). However, it needs to be improved at this time.

Copy the hadoop-eclipse-plugin.jar support package in hadoop to the plugin directory of eclipse to add Hadoop support to eclipse.

After starting Eclipse, switch to the MapReduce interface.

在windows工具选项选择showviews的others里面查找map/reduce locations。

在Map/Reduce Locations窗口中建立一个Hadoop Location,以便与Hadoop进行关联。

注意:此处的两个端口应为你配置hadoop的时候设置的端口!!!

完成后会建立好一个Hadoop Location

在左侧的DFS Location中,还可以看到HDFS中的各个目录

并且你可以在其目录下自由创建文件夹来存取数据。

下面你就可以创建mapreduce项目了,方法同正常创建一样。

4、网络数据爬取

现在我们通过编写一段程序,来将爬取的新闻内容的有效信息保存到HDFS中。

此时就有了两种网络爬虫的方法:

其一就是利用heritrix工具获取的数据;

其一就是java代码结合jsoup编写的网络爬虫。

方法一的信息保存到HDFS:

直接读取生成的本地文件,用jsoup解析html,此时需要将jsoup的jar包导入到项目中


package org.liky.sina.save;

//这里用到了JSoup开发包,该包可以很简单的提取到HTML中的有效信息
import java.io.File;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class SinaNewsData {

  private static Configuration conf = new Configuration();

  private static FileSystem fs;

  private static Path path;

  private static int count = 0;

  public static void main(String[] args) {
    parseAllFile(new File(
        "E:/heritrix-1.12.1/jobs/sina_news_job_02-20170814013255352/mirror/"));
  }

  public static void parseAllFile(File file) {
    // 判断类型
    if (file.isDirectory()) {
      // 文件夹
      File[] allFile = file.listFiles();
      if (allFile != null) {
        for (File f : allFile) {
          parseAllFile(f);
        }
      }
    } else {
      // 文件
      if (file.getName().endsWith(".html")
          || file.getName().endsWith(".shtml")) {
        parseContent(file.getAbsolutePath());
      }
    }
  }

  public static void parseContent(String filePath) {
    try {
        //用jsoup的方法读取文件路径
      Document doc = Jsoup.parse(new File(filePath), "utf-8");
      //读取标题
      String title = doc.title();
      Elements descElem = doc.getElementsByAttributeValue("name",
          "description");
      Element descE = descElem.first();
      
      // 读取内容
      String content = descE.attr("content");

      if (title != null && content != null) {
        //通过Path来保存数据到HDFS中
        path = new Path("hdfs://localhost:9000/input/"
            + System.currentTimeMillis() + ".txt");

        fs = path.getFileSystem(conf);

        // 建立输出流对象
        FSDataOutputStream os = fs.create(path);
        // 使用os完成输出
        os.writeChars(title + "\r\n" + content);

        os.close();

        count++;

        System.out.println("已经完成" + count + " 个!");
      }

    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The above is the detailed content of Introduction to the implementation method of Java web crawler in hadoop. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Java Platform Independence: What does it mean for developers?Java Platform Independence: What does it mean for developers?May 08, 2025 am 12:27 AM

Java'splatformindependencemeansdeveloperscanwritecodeonceandrunitonanydevicewithoutrecompiling.ThisisachievedthroughtheJavaVirtualMachine(JVM),whichtranslatesbytecodeintomachine-specificinstructions,allowinguniversalcompatibilityacrossplatforms.Howev

How to set up JVM for first usage?How to set up JVM for first usage?May 08, 2025 am 12:21 AM

To set up the JVM, you need to follow the following steps: 1) Download and install the JDK, 2) Set environment variables, 3) Verify the installation, 4) Set the IDE, 5) Test the runner program. Setting up a JVM is not just about making it work, it also involves optimizing memory allocation, garbage collection, performance tuning, and error handling to ensure optimal operation.

How can I check Java platform independence for my product?How can I check Java platform independence for my product?May 08, 2025 am 12:12 AM

ToensureJavaplatformindependence,followthesesteps:1)CompileandrunyourapplicationonmultipleplatformsusingdifferentOSandJVMversions.2)UtilizeCI/CDpipelineslikeJenkinsorGitHubActionsforautomatedcross-platformtesting.3)Usecross-platformtestingframeworkss

Java Features for Modern Development: A Practical OverviewJava Features for Modern Development: A Practical OverviewMay 08, 2025 am 12:12 AM

Javastandsoutinmoderndevelopmentduetoitsrobustfeatureslikelambdaexpressions,streams,andenhancedconcurrencysupport.1)Lambdaexpressionssimplifyfunctionalprogramming,makingcodemoreconciseandreadable.2)Streamsenableefficientdataprocessingwithoperationsli

Mastering Java: Understanding Its Core Features and CapabilitiesMastering Java: Understanding Its Core Features and CapabilitiesMay 07, 2025 pm 06:49 PM

The core features of Java include platform independence, object-oriented design and a rich standard library. 1) Object-oriented design makes the code more flexible and maintainable through polymorphic features. 2) The garbage collection mechanism liberates the memory management burden of developers, but it needs to be optimized to avoid performance problems. 3) The standard library provides powerful tools from collections to networks, but data structures should be selected carefully to keep the code concise.

Can Java be run everywhere?Can Java be run everywhere?May 07, 2025 pm 06:41 PM

Yes,Javacanruneverywhereduetoits"WriteOnce,RunAnywhere"philosophy.1)Javacodeiscompiledintoplatform-independentbytecode.2)TheJavaVirtualMachine(JVM)interpretsorcompilesthisbytecodeintomachine-specificinstructionsatruntime,allowingthesameJava

What is the difference between JDK and JVM?What is the difference between JDK and JVM?May 07, 2025 pm 05:21 PM

JDKincludestoolsfordevelopingandcompilingJavacode,whileJVMrunsthecompiledbytecode.1)JDKcontainsJRE,compiler,andutilities.2)JVMmanagesbytecodeexecutionandsupports"writeonce,runanywhere."3)UseJDKfordevelopmentandJREforrunningapplications.

Java features: a quick guideJava features: a quick guideMay 07, 2025 pm 05:17 PM

Key features of Java include: 1) object-oriented design, 2) platform independence, 3) garbage collection mechanism, 4) rich libraries and frameworks, 5) concurrency support, 6) exception handling, 7) continuous evolution. These features of Java make it a powerful tool for developing efficient and maintainable software.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool