PySpider: A powerful web crawler system written by a Chinese with a powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer.
1. Build environment:
System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09 :22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Python version: Python 3.5.1
1.1. Build python3 environment:
After trying it, I chose the integrated environment Anaconda
1.1.1. Compile
# 下载依赖 yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve # 下载python版本 wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz # 或者使用国内源 wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src # 解压 tar -zxf Python-3.5.1.tgz;cd Python-3.5.1 # 编译安装 ./configure --prefix=/usr/local/python3.5 --enable-shared make && make install # 建立软链接 ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3 echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf ldconfig # 验证python3 python3 # Python 3.5.1 (default, Oct 9 2016, 11:44:24) # [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux # Type "help", "copyright", "credits" or "license" for more information. # >>> # pip /usr/local/python3.5/bin/pip3 install --upgrade pip ln -s /usr/local/python3.5/bin/pip /usr/bin/pip # 本人在安装时出现问题 将pip重装 wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate python get-pip.py
1.1.2. The integrated environment anaconda
# 集成环境anaconda(推荐) wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh # 直接安装即可 ./Anaconda3-4.2.0-Linux-x86_64.sh # 若出错,可能是解压失败 yum install bzip2
1.2. Install mariaDB
# 安装 yum -y install mariadb mariadb-server # 启动 systemctl start mariadb # 设置为开机启动 systemctl enable mariadb # 配置密码 默认为空 mysql_secure_installation # 登录 mysql -u root -p # 创建一个用户 自己设定账户密码 CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION; CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;
1.3. Install pyspider
I use Anaconda
# 搭建虚拟环境sbird python版本3.* conda create -n sbird python=3* # 进入环境 source activate sbird # 安装pyspider pip install pyspider # 报错 # it does not exist. The exported locale is "en_US.UTF-8" but it is not supported # 执行 可写入.bashrc export LC_ALL=en_US.utf-8 export LANG=en_US.utf-8 #ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0) conda install pycurl # 退出 source deactivate sbird # 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙 systemctl stop firewalld.service #########直接运行源码============== mkdir git;cd git # 下载 git clone https://github.com/binux/pyspider.git # 安装 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
Other methods
# 搭建虚拟环境 pip install virtualenv mkdir python;cd python # 创建虚拟环境pyenv3 virtualenv -p /usr/bin/python3 pyenv3 # 进入虚拟环境 激活环境 cd pyenv3/ source ./bin/activate pip install pyspider # 若pycurl报错 yum install libcurl-devel # 继续 pip install pyspider # 关闭 deactivate
I recommend using anaconda to install
If An error occurred during the running of pyspider. Please refer to the anaconda installation section. At this point, visit localhost:5000 to see the page.
1.4.Install Supervisor
##
# 安装 yum install supervisor -y # 若无法检索 则添加阿里的epel源 vim /etc/yum.repos.d/epel.repo # 添加以下内容 [epel] name=Extra Packages for Enterprise Linux 7 - $basearch baseurl=http://mirrors.aliyun.com/epel/7/$basearch http://mirrors.aliyuncs.com/epel/7/$basearch failovermethod=priority enabled=1 gpgcheck=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 [epel-debuginfo] name=Extra Packages for Enterprise Linux 7 - $basearch - Debug baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug http://mirrors.aliyuncs.com/epel/7/$basearch/debug failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 [epel-source] name=Extra Packages for Enterprise Linux 7 - $basearch - Source baseurl=http://mirrors.aliyun.com/epel/7/SRPMS http://mirrors.aliyuncs.com/epel/7/SRPMS failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 # 安装 yum install supervisor -y # 测试是否安装成功 echo_supervisord_conf1.4.1.Supervisor usage
supervisord #supervisor的服务器端部分 启动 supervisorctl #启动supervisor的命令行窗口 # 假设创建进程pyspider01 vim /etc/supervisord.d/pyspider01.ini # 写入以下内容 [program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 启动 supervisorctl start pyspider01 # 也可这样启动 supervisord -c /etc/supervisord.conf # 查看状态 supervisorctl status # output pyspider01 RUNNING pid 4026, uptime 0:02:40 # 关闭 supervisorctl shutdown
# 消息队列采用redis mkdir download;cd download wget http://download.redis.io/releases/redis-3.2.4.tar.gz tar xzf redis-3.2.4.tar.gz cd redis-3.2.4 make # 或者直接yum安装 yum -y install redis # 启动 systemctl start redis.service # 重启 systemctl restart redis.service # 停止 systemctl stop redis.service # 查看状态 systemctl status redis.service # 更改文件/etc/redis.conf vim /etc/redis.conf # 更改内容 daemonize no 改为 daemonize yes bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip) # 重启redis systemctl restart redis.service
1.6. About self-start
# Supervisor添加到自启动服务 systemctl enable supervisord.service # redis添加到自启动服务 systemctl enable redis.service # 关闭防火墙自启动 systemctl disable firewalld.serviceAt this point, the pyspider single server operating environment has been built and deployed. Start localhost:5000 to enter the web interface. You can also write a script to run and check the running status in /pyspider/supervisor/pyspider01.log.
2. Distributed deployment
Name the server you just configured centos01. According to this configuration, deploy two centos02 and centos03 respectively. As follows:Server name ip descriptioncentos01 10.211.55.22 redis,mariaDB, scheduler centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs centos03 10.211.55.24 fetcher, processor,,result_worker,webui2.1.centos01Enter server centos01, After the first step, the basic environment has been set up. First edit the
configuration file /pyspider/config.json
##
{ "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb", "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb", "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb", "message_queue": "redis://10.211.55.22:6379/db", "logging-config": "/pyspider/logging.conf", "phantomjs-proxy":"10.211.55.23:25555", "webui": { "username": "", "password": "", "need-auth": false, "host":"10.211.55.24", "port":"5000", "scheduler-rpc":"http:// 10.211.55.22:5002", "fetcher-rpc":"http://10.211.55.23:5001" }, "fetcher": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5001" }, "scheduler": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5002" } }
and try to run:
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 报错 ImportError: No module named 'mysql' # 下载 mysql-connector-python cd ~/git/ git clone https://github.com/mysql/mysql-connector-python.git # 安装 source activate sbird cd mysql-connector-python python setup.py install # 安装redis pip install redis source deactivate # 运行 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 输出 ok [I 161010 15:57:25 scheduler:644] scheduler starting... [I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002 [I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0
After successful operation, you can directly change /etc/supervisord.d/pyspider01.ini as follows:
[program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 查看状态 supervisorctl status
centos01 deployment complete. 2.2.centos02
In centos02, you need to run result_worker, processor, phantomjs, and fetcher
to create files respectively:
/etc/supervisord.d/result_worker.ini [program:result_worker] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/result_worker.log /etc/supervisord.d/processor.ini [program:processor] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/processor.log /etc/supervisord.d/phantomjs.ini [program:phantomjs] command = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555 directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/phantomjs.log /etc/supervisord.d/fetcher.ini [program:fetcher] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/fetcher.log
Create pjsconfig.json in the pyspider directory
{ /*--ignore-ssl-errors=true */ "ignoreSslErrors": true, /*--ssl-protocol=true */ "sslprotocol": "any", /* Same as: --output-encoding=utf8 */ "outputEncoding": "utf8", /* persistent Cookies. */ /*cookiesfile="e:/phontjscookies.txt",*/ cookiesfile="pyspider/phontjscookies.txt", /* load image */ autoLoadImages = false }
Download phantomjs to the /pyspider/ folder and add git/pyspider/pyspider/ fetcher/phantomjs_fetcher.js is copied to phantomjs_fetcher.js ##
# 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 3446, uptime 0:00:07 phantomjs RUNNING pid 3448, uptime 0:00:07 processor RUNNING pid 3447, uptime 0:00:07 result_worker RUNNING pid 3445, uptime 0:00:07
centos02 is deployed.
2.3.centos03
The deployment of these three processes fetcher, processor, result_worker is the same as centos02. This server mainly adds webui on the basis of the previous ones.
Create file:/etc/supervisord.d/webui.ini [program:webui] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/webui.log # 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 2724, uptime 0:00:07 processor RUNNING pid 2725, uptime 0:00:07 result_worker RUNNING pid 2723, uptime 0:00:07 webui RUNNING pid 2726, uptime 0:00:07
3. Summary
1.
Python Free video tutorial
2. Python learning manual
3. Python object-oriented video tutorial
The above is the detailed content of Powerful web crawler system: pyspider. For more information, please follow other related articles on the PHP Chinese website!

由于最近刚开始负责对象存储相关系统的建设与稳定性运维,作为一个“对象存储”的一个新手,需要加强这块的学习。由于公司目前采用MinIO来搭建公司的对象存储体系,后续我会逐步将自己关于MinIO的学习经验分享出来,欢迎大家持续关注。本文主要是介绍如何在测试环境中搭建MinIO,这也是构建MinIO学习环境最基本的步骤。1、准备实验环境使用OracleVMVirtualBox虚拟机,安装一个最小版本的Linux,然后添加4块虚拟盘,用于充当MinIO的虚拟盘。实验环境如下所示:接下来和大家简单介绍一下

如果你正在使用 CentOS 7 操作系统,需要查看 PHP 安装目录以便定位配置文件、扩展等相关信息,那么就需要了解一些相关命令和技巧。下面,我们将为您介绍一些方法来查看 CentOS 7 上的 PHP 安装目录。

1、官网下载安装包选择适合linux的版本,这里选择最新的版本,下载到本地后上传到服务器或者centos下直接wget命令下载。切换到/usr/local目录,下载软件包#cd/usr/local#wgethttp://nginx.org/download/nginx-1.11.5.tar.gz2、安装nginx先执行以下命令,安装nginx依赖库,如果缺少依赖库,可能会安装失败,具体可以参考文章后面的错误提示信息。#yuminstallgcc-c++#yuminstallpcre#yumins

先决条件64位的centos7服务器的root权限步骤1-在centos7中安装nginx和php7-fpm在开始安装nginx和php7-fpm之前,我们还学要先添加epel包的仓库源。使用如下命令:yum-yinstallepel-release现在开始从epel仓库来安装nginx:yum-yinstallnginx然后我们还需要为php7-fpm添加另外一个仓库。互联网中有很个远程仓库提供了php7系列包,我在这里使用的是webtatic。添加php7-fpmwebtatic仓库:rpm

安装环境:Centos764位Jdk1.864位Xshell免费版win10*64位一、先进来,你需要检查自己的openjdk是否卸载(或者判断是否存在,因为一般centos都会预装openjdk):在xshell或rpm-qa|grepjdk中输入rpm-qa|grepjavarpm-qa|grepjava第二,如果有一个对应的openjdk,并且显示了一个响应列表,那么就需要卸载它。在xshell中输入rpm-e-nodepstzdata-文件名(这个文件名是你查看的openjdk文件列表中

随着互联网的发展和应用,分布式系统也越来越受到人们的关注和重视。而在分布式系统中,如何实现快速部署和便捷管理则成为了一项必要的技术。本文将介绍如何使用Gin框架来实现分布式系统的部署和管理功能。一、分布式系统部署分布式系统的部署主要包括了代码部署、环境部署、配置管理和服务注册等几个方面。以下将逐一介绍这些方面。代码部署在分布式系统中,代码部署是一个重要的环节

Centos7修改系统时区的两种方法:1、使用timedatectl命令,可设定和修改时区信息,语法“timedatectl set-timezone 时区标识”;2、修改用户目录下的“.bash_profile”文件,在文件末尾追加“TZ='时区标识'; export TZ”即可。

1.下载4个rpm包mysql-community-client-5.7.26-1.el7.x86_64.rpmmysql-community-common-5.7.26-1.el7.x86_64.rpmmysql-community-libs-5.7.26-1.el7.x86_64.rpmmysql-community-server-5.7.26-1.el7.x86_64.rpm想要用迅雷进行下载得先找到对应的rpm下载路径首先浏览器打开mysql官网:在打开的界面,按键盘f12打开开发者工具


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Dreamweaver CS6
Visual web development tools

Dreamweaver Mac version
Visual web development tools

Notepad++7.3.1
Easy-to-use and free code editor

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.