python使用rabbitmq实现网络爬虫示例-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

python使用rabbitmq实现网络爬虫示例

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 06, 2016 am 11:29 AM

rabbitmqWeb Crawler

编写tasks.py

代码如下:

from celery import Celery
from tornado.httpclient import HTTPClient
app = Celery('tasks')
app.config_from_object('celeryconfig')
@app.task
def get_html(url):
    http_client = HTTPClient()
    try:
        response = http_client.fetch(url,follow_redirects=True)
        return response.body
    except httpclient.HTTPError as e:
        return None
    http_client.close()

编写celeryconfig.py

代码如下:

CELERY_IMPORTS = ('tasks',)
BROKER_URL = 'amqp://guest@localhost:5672//'
CELERY_RESULT_BACKEND = 'amqp://'

编写spider.py

代码如下:

from tasks import get_html
from queue import Queue
from bs4 import BeautifulSoup
from urllib.parse import urlparse,urljoin
import threading
class spider(object):
    def __init__(self):
        self.visited={}
        self.queue=Queue()
    def process_html(self, html):
        pass
        #print(html)
    def _add_links_to_queue(self,url_base,html):
        soup = BeautifulSoup(html)
        links=soup.find_all('a')
        for link in links:
            try:
                url=link['href']
            except:
                pass
            else:
                url_com=urlparse(url)
                if not url_com.netloc:
                    self.queue.put(urljoin(url_base,url))
                else:
                    self.queue.put(url_com.geturl())
    def start(self,url):
        self.queue.put(url)
        for i in range(20):
            t = threading.Thread(target=self._worker)
            t.daemon = True
            t.start()
        self.queue.join()
    def _worker(self):
        while 1:
            url=self.queue.get()
            if url in self.visited:
                continue
            else:
                result=get_html.delay(url)
                try:
                    html=result.get(timeout=5)
                except Exception as e:
                    print(url)
                    print(e)
                self.process_html(html)
                self._add_links_to_queue(url,html)

self.visited[url]=True
self.queue.task_done()
s=spider()
s.start("http://www.bitsCN.com/")

由于html中某些特殊情况的存在，程序还有待完善。

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

如何在PHP中使用RabbitMQ实现分布式消息处理Jul 18, 2023 am 11:00 AM

如何在PHP中使用RabbitMQ实现分布式消息处理引言：在大规模应用程序开发中，分布式系统已成为一个常见的需求。分布式消息处理是这样的一种模式，通过将任务分发到多个处理节点，可以提高系统的效率和可靠性。RabbitMQ是一个开源的，可靠的消息队列系统，它采用AMQP协议来实现消息的传递和处理。在本文中，我们将介绍如何在PHP中使用RabbitMQ来实现分布

在Go语言中使用RabbitMQ：完整指南Jun 19, 2023 am 08:10 AM

随着现代应用程序的复杂性增加，消息传递已成为一种强大的工具。在这个领域，RabbitMQ已成为一个非常受欢迎的消息代理，可以用于在不同的应用程序之间传递消息。在这篇文章中，我们将探讨如何在Go语言中使用RabbitMQ。本指南将涵盖以下内容：RabbitMQ简介RabbitMQ安装RabbitMQ基础概念Go语言中的RabbitMQ入门RabbitMQ和Go

SpringBoot怎么整合RabbitMQ实现延迟队列May 16, 2023 pm 08:31 PM

如何保证消息不丢失rabbitmq消息投递路径生产者->交换机->队列->消费者总的来说分为三个阶段。1.生产者保证消息投递可靠性。2.mq内部消息不丢失。3.消费者消费成功。什么是消息投递可靠性简单点说就是消息百分百发送到消息队列中。我们可以开启confirmCallback生产者投递消息后，mq会给生产者一个ack.根据ack,生产者就可以确认这条消息是否发送到mq.开启confirmCallback修改配置文件#NONE：禁用发布确认模式，是默认值，CORRELATED：

go-zero与RabbitMQ的应用实践Jun 23, 2023 pm 12:54 PM

现在越来越多的企业开始采用微服务架构模式，而在这个架构中，消息队列成为一种重要的通信方式，其中RabbitMQ被广泛应用。而在go语言中，go-zero是近年来崛起的一种框架，它提供了很多实用的工具和方法，让开发者更加轻松地使用消息队列，下面我们将结合实际应用，来介绍go-zero和RabbitMQ的使用方法和应用实践。1.RabbitMQ概述Rabbit

Swoole与RabbitMQ集成实践：打造高可用性消息队列系统Jun 14, 2023 pm 12:56 PM

随着互联网时代的到来，消息队列系统变得越来越重要。它可以使不同的应用之间实现异步操作、降低耦合度、提高可扩展性，进而提升整个系统的性能和用户体验。在消息队列系统中，RabbitMQ是一个强大的开源消息队列软件，它支持多种消息协议、被广泛应用于金融交易、电子商务、在线游戏等领域。在实际应用中，往往需要将RabbitMQ和其他系统进行集成。本文将介绍如何使用sw

Golang中使用RabbitMQ实现任务分发与负载均衡的策略Sep 27, 2023 am 11:22 AM

Golang中使用RabbitMQ实现任务分发与负载均衡的策略概述：在分布式系统中，任务的分发与负载均衡是非常重要的。一种常见的解决方案是使用消息队列来实现任务的分发与处理。本文将介绍如何使用Golang和RabbitMQ实现任务的分发与负载均衡的策略，并提供具体的代码示例。RabbitMQ简介：RabbitMQ是一个可靠、可扩展、开放源代码的消息中间件，它

SpringBoot怎么整合RabbitMQ处理死信队列和延迟队列May 15, 2023 pm 03:28 PM

简介RabbitMQ消息简介RabbitMQ的消息默认不会超时。什么是死信队列？什么是延迟队列？死信队列：DLX，全称为Dead-Letter-Exchange,可以称之为死信交换器，也有人称之为死信邮箱。当消息在一个队列中变成死信（deadmessage)之后，它能被重新被发送到另一个交换器中，这个交换器就是DLX，绑定DLX的队列就称之为死信队列。以下几种情况会导致消息变成死信：消息被拒绝（Basic.Reject/Basic.Nack)，并且设置requeue参数为false;消息过期；队

PHP开发：使用 RabbitMQ 实现任务队列Jun 15, 2023 pm 05:33 PM

随着互联网的不断发展，网站的流量越来越大，访问量的增长带来的问题也越来越多。当用户量过大时，服务器负载会增大，这时就需要使用一些技术手段来解决这些问题。任务队列就是其中的一种方式，可以将一些耗时的操作异步执行，从而缓解服务器压力。本文将介绍如何使用RabbitMQ实现任务队列。一、什么是RabbitMQRabbitMQ是一个开源的消息中间件，它实现了

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Hot Tools

Atom editor mac version download

The most popular open source editor

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),