search
HomeBackend DevelopmentPython TutorialPowerful web crawler system: pyspider
Powerful web crawler system: pyspiderMay 12, 2017 am 10:35 AM
pyspider

PySpider: A powerful web crawler system written by a Chinese with a powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer.

1. Build environment:

System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09 :22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Python version: Python 3.5.1

1.1. Build python3 environment:

After trying it, I chose the integrated environment Anaconda

1.1.1. Compile


# 下载依赖
yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve
# 下载python版本
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
# 或者使用国内源
wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz
mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src
# 解压
tar -zxf Python-3.5.1.tgz;cd Python-3.5.1
# 编译安装
./configure --prefix=/usr/local/python3.5 --enable-shared
make && make install
# 建立软链接
ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3
echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf
ldconfig
# 验证python3
python3
# Python 3.5.1 (default, Oct 9 2016, 11:44:24)
# [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
# pip
/usr/local/python3.5/bin/pip3 install --upgrade pip
ln -s /usr/local/python3.5/bin/pip /usr/bin/pip
# 本人在安装时出现问题 将pip重装
wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate
python get-pip.py

1.1.2. The integrated environment anaconda


# 集成环境anaconda(推荐)
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
# 直接安装即可
./Anaconda3-4.2.0-Linux-x86_64.sh
# 若出错,可能是解压失败
yum install bzip2

1.2. Install mariaDB


# 安装
yum -y install mariadb mariadb-server
# 启动
systemctl start mariadb
# 设置为开机启动
systemctl enable mariadb
# 配置密码 默认为空
mysql_secure_installation
# 登录
mysql -u root -p
# 创建一个用户 自己设定账户密码
CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;
CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;

1.3. Install pyspider

I use Anaconda


# 搭建虚拟环境sbird python版本3.*
conda create -n sbird python=3*
# 进入环境
source activate sbird
# 安装pyspider
pip install pyspider
# 报错 
# it does not exist. The exported locale is "en_US.UTF-8" but it is not supported
# 执行 可写入.bashrc
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
#ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)
conda install pycurl
# 退出
source deactivate sbird
# 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙
systemctl stop firewalld.service
#########直接运行源码==============
mkdir git;cd git
# 下载
git clone https://github.com/binux/pyspider.git
# 安装
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py

Other methods


# 搭建虚拟环境
pip install virtualenv
mkdir python;cd python
# 创建虚拟环境pyenv3
virtualenv -p /usr/bin/python3 pyenv3
# 进入虚拟环境 激活环境
cd pyenv3/
source ./bin/activate
pip install pyspider
# 若pycurl报错 
yum install libcurl-devel
# 继续
pip install pyspider
# 关闭
deactivate

I recommend using anaconda to install

If An error occurred during the running of pyspider. Please refer to the anaconda installation section. At this point, visit localhost:5000 to see the page.

1.4.Install Supervisor


##

# 安装
yum install supervisor -y
# 若无法检索 则添加阿里的epel源
vim /etc/yum.repos.d/epel.repo
# 添加以下内容
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
http://mirrors.aliyuncs.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
http://mirrors.aliyuncs.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
http://mirrors.aliyuncs.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
# 安装
yum install supervisor -y
# 测试是否安装成功
echo_supervisord_conf

1.4.1.Supervisor usage


supervisord   #supervisor的服务器端部分 启动
supervisorctl  #启动supervisor的命令行窗口
# 假设创建进程pyspider01
vim /etc/supervisord.d/pyspider01.ini
# 写入以下内容
[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 启动
supervisorctl start pyspider01
# 也可这样启动
supervisord -c /etc/supervisord.conf
# 查看状态
supervisorctl status
# output 
pyspider01            RUNNING  pid 4026, uptime 0:02:40
# 关闭
supervisorctl shutdown

1.5. Installationredis


# 消息队列采用redis
mkdir download;cd download
wget http://download.redis.io/releases/redis-3.2.4.tar.gz
tar xzf redis-3.2.4.tar.gz
cd redis-3.2.4
make
# 或者直接yum安装
yum -y install redis
# 启动
systemctl start redis.service
# 重启
systemctl restart redis.service
# 停止
systemctl stop redis.service
# 查看状态
systemctl status redis.service
# 更改文件/etc/redis.conf
vim /etc/redis.conf
# 更改内容
daemonize no 改为 daemonize yes
bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip)
# 重启redis
systemctl restart redis.service

1.6. About self-start


# Supervisor添加到自启动服务
systemctl enable supervisord.service
# redis添加到自启动服务
systemctl enable redis.service
# 关闭防火墙自启动
systemctl disable firewalld.service

At this point, the pyspider single server operating environment has been built and deployed. Start localhost:5000 to enter the web interface.

You can also write a script to run and check the running status in /pyspider/supervisor/pyspider01.log.

2. Distributed deployment

Name the server you just configured centos01. According to this configuration, deploy two centos02 and centos03 respectively.

As follows:

Server name ip description



centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui

2.1.centos01

Enter server centos01, After the first step, the basic environment has been set up. First edit the

configuration file /pyspider/config.json

##

{
 "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb",
 "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb",
 "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb",
 "message_queue": "redis://10.211.55.22:6379/db",
 "logging-config": "/pyspider/logging.conf",
 "phantomjs-proxy":"10.211.55.23:25555",
 "webui": {
  "username": "",
  "password": "",
  "need-auth": false,
  "host":"10.211.55.24",
  "port":"5000",
  "scheduler-rpc":"http:// 10.211.55.22:5002",
  "fetcher-rpc":"http://10.211.55.23:5001"
 },
 "fetcher": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5001"
 },
 "scheduler": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5002"
 }
}

and try to run:

/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 报错
ImportError: No module named 'mysql'
# 下载 mysql-connector-python
cd ~/git/
git clone https://github.com/mysql/mysql-connector-python.git
# 安装
source activate sbird
cd mysql-connector-python
python setup.py install
# 安装redis
pip install redis
source deactivate
# 运行
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 输出 ok
[I 161010 15:57:25 scheduler:644] scheduler starting...
[I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002
[I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0

After successful operation, you can directly change /etc/supervisord.d/pyspider01.ini as follows:

[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status

centos01 deployment complete.

2.2.centos02

In centos02, you need to run result_worker, processor, phantomjs, and fetcher

to create files respectively:

/etc/supervisord.d/result_worker.ini

[program:result_worker]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/result_worker.log
/etc/supervisord.d/processor.ini

[program:processor]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/processor.log
/etc/supervisord.d/phantomjs.ini

[program:phantomjs]

command   = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/phantomjs.log
/etc/supervisord.d/fetcher.ini

[program:fetcher]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/fetcher.log

Create pjsconfig.json in the pyspider directory

{
 /*--ignore-ssl-errors=true */
 "ignoreSslErrors": true,

 /*--ssl-protocol=true */
 "sslprotocol": "any",

 /* Same as: --output-encoding=utf8 */
 "outputEncoding": "utf8",

 /* persistent Cookies. */
 /*cookiesfile="e:/phontjscookies.txt",*/
 cookiesfile="pyspider/phontjscookies.txt",

 /* load image */
 autoLoadImages = false
}

Download phantomjs to the /pyspider/ folder and add git/pyspider/pyspider/ fetcher/phantomjs_fetcher.js is copied to phantomjs_fetcher.js

##
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 3446, uptime 0:00:07
phantomjs            RUNNING  pid 3448, uptime 0:00:07
processor            RUNNING  pid 3447, uptime 0:00:07
result_worker          RUNNING  pid 3445, uptime 0:00:07

centos02 is deployed.

2.3.centos03

The deployment of these three processes fetcher, processor, result_worker is the same as centos02. This server mainly adds webui on the basis of the previous ones.

Create file:

/etc/supervisord.d/webui.ini

[program:webui]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/webui.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 2724, uptime 0:00:07
processor            RUNNING  pid 2725, uptime 0:00:07
result_worker          RUNNING  pid 2723, uptime 0:00:07
webui              RUNNING  pid 2726, uptime 0:00:07

3. Summary

[Related recommendations]

1.
Python Free video tutorial

2. Python learning manual

3. Python object-oriented video tutorial

The above is the detailed content of Powerful web crawler system: pyspider. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Linux中如何构建4块虚拟盘来搭建分布式MinIO集群?Linux中如何构建4块虚拟盘来搭建分布式MinIO集群?Feb 10, 2024 pm 04:48 PM

由于最近刚开始负责对象存储相关系统的建设与稳定性运维,作为一个“对象存储”的一个新手,需要加强这块的学习。由于公司目前采用MinIO来搭建公司的对象存储体系,后续我会逐步将自己关于MinIO的学习经验分享出来,欢迎大家持续关注。本文主要是介绍如何在测试环境中搭建MinIO,这也是构建MinIO学习环境最基本的步骤。1、准备实验环境使用OracleVMVirtualBox虚拟机,安装一个最小版本的Linux,然后添加4块虚拟盘,用于充当MinIO的虚拟盘。实验环境如下所示:接下来和大家简单介绍一下

centos7怎么查看php安装目录?三种方法分享centos7怎么查看php安装目录?三种方法分享Mar 22, 2023 am 10:38 AM

如果你正在使用 CentOS 7 操作系统,需要查看 PHP 安装目录以便定位配置文件、扩展等相关信息,那么就需要了解一些相关命令和技巧。下面,我们将为您介绍一些方法来查看 CentOS 7 上的 PHP 安装目录。

CentOS7如何安装Nginx并配置自动启动CentOS7如何安装Nginx并配置自动启动May 14, 2023 pm 03:01 PM

1、官网下载安装包选择适合linux的版本,这里选择最新的版本,下载到本地后上传到服务器或者centos下直接wget命令下载。切换到/usr/local目录,下载软件包#cd/usr/local#wgethttp://nginx.org/download/nginx-1.11.5.tar.gz2、安装nginx先执行以下命令,安装nginx依赖库,如果缺少依赖库,可能会安装失败,具体可以参考文章后面的错误提示信息。#yuminstallgcc-c++#yuminstallpcre#yumins

怎么在CentOS7中使用Nginx和PHP7-FPM安装Nextcloud怎么在CentOS7中使用Nginx和PHP7-FPM安装NextcloudMay 24, 2023 pm 08:13 PM

先决条件64位的centos7服务器的root权限步骤1-在centos7中安装nginx和php7-fpm在开始安装nginx和php7-fpm之前,我们还学要先添加epel包的仓库源。使用如下命令:yum-yinstallepel-release现在开始从epel仓库来安装nginx:yum-yinstallnginx然后我们还需要为php7-fpm添加另外一个仓库。互联网中有很个远程仓库提供了php7系列包,我在这里使用的是webtatic。添加php7-fpmwebtatic仓库:rpm

如何在 CentOS 7 中安装并配置 Java 环境变量?如何在 CentOS 7 中安装并配置 Java 环境变量?Apr 22, 2023 pm 04:28 PM

安装环境:Centos764位Jdk1.864位Xshell免费版win10*64位一、先进来,你需要检查自己的openjdk是否卸载(或者判断是否存在,因为一般centos都会预装openjdk):在xshell或rpm-qa|grepjdk中输入rpm-qa|grepjavarpm-qa|grepjava第二,如果有一个对应的openjdk,并且显示了一个响应列表,那么就需要卸载它。在xshell中输入rpm-e-nodepstzdata-文件名(这个文件名是你查看的openjdk文件列表中

使用Gin框架实现分布式部署和管理功能使用Gin框架实现分布式部署和管理功能Jun 22, 2023 pm 11:39 PM

随着互联网的发展和应用,分布式系统也越来越受到人们的关注和重视。而在分布式系统中,如何实现快速部署和便捷管理则成为了一项必要的技术。本文将介绍如何使用Gin框架来实现分布式系统的部署和管理功能。一、分布式系统部署分布式系统的部署主要包括了代码部署、环境部署、配置管理和服务注册等几个方面。以下将逐一介绍这些方面。代码部署在分布式系统中,代码部署是一个重要的环节

Centos7改系统时区方法有哪些Centos7改系统时区方法有哪些Mar 03, 2023 am 10:47 AM

Centos7修改系统时区的两种方法:1、使用timedatectl命令,可设定和修改时区信息,语法“timedatectl set-timezone 时区标识”;2、修改用户目录下的“.bash_profile”文件,在文件末尾追加“TZ='时区标识'; export TZ”即可。

centos7使用rpm安装mysql5.7的方法centos7使用rpm安装mysql5.7的方法May 27, 2023 am 08:05 AM

1.下载4个rpm包mysql-community-client-5.7.26-1.el7.x86_64.rpmmysql-community-common-5.7.26-1.el7.x86_64.rpmmysql-community-libs-5.7.26-1.el7.x86_64.rpmmysql-community-server-5.7.26-1.el7.x86_64.rpm想要用迅雷进行下载得先找到对应的rpm下载路径首先浏览器打开mysql官网:在打开的界面,按键盘f12打开开发者工具

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.