PySpider: A powerful web crawler system written by a Chinese with a powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer.
1. Build environment:
System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09 :22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Python version: Python 3.5.1
1.1. Build python3 environment:
After trying it, I chose the integrated environment Anaconda
1.1.1. Compile
# 下载依赖 yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve # 下载python版本 wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz # 或者使用国内源 wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src # 解压 tar -zxf Python-3.5.1.tgz;cd Python-3.5.1 # 编译安装 ./configure --prefix=/usr/local/python3.5 --enable-shared make && make install # 建立软链接 ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3 echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf ldconfig # 验证python3 python3 # Python 3.5.1 (default, Oct 9 2016, 11:44:24) # [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux # Type "help", "copyright", "credits" or "license" for more information. # >>> # pip /usr/local/python3.5/bin/pip3 install --upgrade pip ln -s /usr/local/python3.5/bin/pip /usr/bin/pip # 本人在安装时出现问题 将pip重装 wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate python get-pip.py
1.1.2. The integrated environment anaconda
# 集成环境anaconda(推荐) wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh # 直接安装即可 ./Anaconda3-4.2.0-Linux-x86_64.sh # 若出错,可能是解压失败 yum install bzip2
1.2. Install mariaDB
# 安装 yum -y install mariadb mariadb-server # 启动 systemctl start mariadb # 设置为开机启动 systemctl enable mariadb # 配置密码 默认为空 mysql_secure_installation # 登录 mysql -u root -p # 创建一个用户 自己设定账户密码 CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION; CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;
1.3. Install pyspider
I use Anaconda
# 搭建虚拟环境sbird python版本3.* conda create -n sbird python=3* # 进入环境 source activate sbird # 安装pyspider pip install pyspider # 报错 # it does not exist. The exported locale is "en_US.UTF-8" but it is not supported # 执行 可写入.bashrc export LC_ALL=en_US.utf-8 export LANG=en_US.utf-8 #ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0) conda install pycurl # 退出 source deactivate sbird # 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙 systemctl stop firewalld.service #########直接运行源码============== mkdir git;cd git # 下载 git clone https://github.com/binux/pyspider.git # 安装 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
Other methods
# 搭建虚拟环境 pip install virtualenv mkdir python;cd python # 创建虚拟环境pyenv3 virtualenv -p /usr/bin/python3 pyenv3 # 进入虚拟环境 激活环境 cd pyenv3/ source ./bin/activate pip install pyspider # 若pycurl报错 yum install libcurl-devel # 继续 pip install pyspider # 关闭 deactivate
I recommend using anaconda to install
If An error occurred during the running of pyspider. Please refer to the anaconda installation section. At this point, visit localhost:5000 to see the page.
1.4.Install Supervisor
##
# 安装 yum install supervisor -y # 若无法检索 则添加阿里的epel源 vim /etc/yum.repos.d/epel.repo # 添加以下内容 [epel] name=Extra Packages for Enterprise Linux 7 - $basearch baseurl=http://mirrors.aliyun.com/epel/7/$basearch http://mirrors.aliyuncs.com/epel/7/$basearch failovermethod=priority enabled=1 gpgcheck=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 [epel-debuginfo] name=Extra Packages for Enterprise Linux 7 - $basearch - Debug baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug http://mirrors.aliyuncs.com/epel/7/$basearch/debug failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 [epel-source] name=Extra Packages for Enterprise Linux 7 - $basearch - Source baseurl=http://mirrors.aliyun.com/epel/7/SRPMS http://mirrors.aliyuncs.com/epel/7/SRPMS failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 # 安装 yum install supervisor -y # 测试是否安装成功 echo_supervisord_conf1.4.1.Supervisor usage
supervisord #supervisor的服务器端部分 启动 supervisorctl #启动supervisor的命令行窗口 # 假设创建进程pyspider01 vim /etc/supervisord.d/pyspider01.ini # 写入以下内容 [program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 启动 supervisorctl start pyspider01 # 也可这样启动 supervisord -c /etc/supervisord.conf # 查看状态 supervisorctl status # output pyspider01 RUNNING pid 4026, uptime 0:02:40 # 关闭 supervisorctl shutdown
# 消息队列采用redis mkdir download;cd download wget http://download.redis.io/releases/redis-3.2.4.tar.gz tar xzf redis-3.2.4.tar.gz cd redis-3.2.4 make # 或者直接yum安装 yum -y install redis # 启动 systemctl start redis.service # 重启 systemctl restart redis.service # 停止 systemctl stop redis.service # 查看状态 systemctl status redis.service # 更改文件/etc/redis.conf vim /etc/redis.conf # 更改内容 daemonize no 改为 daemonize yes bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip) # 重启redis systemctl restart redis.service
1.6. About self-start
# Supervisor添加到自启动服务 systemctl enable supervisord.service # redis添加到自启动服务 systemctl enable redis.service # 关闭防火墙自启动 systemctl disable firewalld.serviceAt this point, the pyspider single server operating environment has been built and deployed. Start localhost:5000 to enter the web interface. You can also write a script to run and check the running status in /pyspider/supervisor/pyspider01.log.
2. Distributed deployment
Name the server you just configured centos01. According to this configuration, deploy two centos02 and centos03 respectively. As follows:Server name ip descriptioncentos01 10.211.55.22 redis,mariaDB, scheduler centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs centos03 10.211.55.24 fetcher, processor,,result_worker,webui2.1.centos01Enter server centos01, After the first step, the basic environment has been set up. First edit the
configuration file /pyspider/config.json
##
{ "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb", "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb", "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb", "message_queue": "redis://10.211.55.22:6379/db", "logging-config": "/pyspider/logging.conf", "phantomjs-proxy":"10.211.55.23:25555", "webui": { "username": "", "password": "", "need-auth": false, "host":"10.211.55.24", "port":"5000", "scheduler-rpc":"http:// 10.211.55.22:5002", "fetcher-rpc":"http://10.211.55.23:5001" }, "fetcher": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5001" }, "scheduler": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5002" } }
and try to run:
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 报错 ImportError: No module named 'mysql' # 下载 mysql-connector-python cd ~/git/ git clone https://github.com/mysql/mysql-connector-python.git # 安装 source activate sbird cd mysql-connector-python python setup.py install # 安装redis pip install redis source deactivate # 运行 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 输出 ok [I 161010 15:57:25 scheduler:644] scheduler starting... [I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002 [I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0
After successful operation, you can directly change /etc/supervisord.d/pyspider01.ini as follows:
[program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 查看状态 supervisorctl status
centos01 deployment complete. 2.2.centos02
In centos02, you need to run result_worker, processor, phantomjs, and fetcher
to create files respectively:
/etc/supervisord.d/result_worker.ini [program:result_worker] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/result_worker.log /etc/supervisord.d/processor.ini [program:processor] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/processor.log /etc/supervisord.d/phantomjs.ini [program:phantomjs] command = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555 directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/phantomjs.log /etc/supervisord.d/fetcher.ini [program:fetcher] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/fetcher.log
Create pjsconfig.json in the pyspider directory
{ /*--ignore-ssl-errors=true */ "ignoreSslErrors": true, /*--ssl-protocol=true */ "sslprotocol": "any", /* Same as: --output-encoding=utf8 */ "outputEncoding": "utf8", /* persistent Cookies. */ /*cookiesfile="e:/phontjscookies.txt",*/ cookiesfile="pyspider/phontjscookies.txt", /* load image */ autoLoadImages = false }
Download phantomjs to the /pyspider/ folder and add git/pyspider/pyspider/ fetcher/phantomjs_fetcher.js is copied to phantomjs_fetcher.js ##
# 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 3446, uptime 0:00:07 phantomjs RUNNING pid 3448, uptime 0:00:07 processor RUNNING pid 3447, uptime 0:00:07 result_worker RUNNING pid 3445, uptime 0:00:07
centos02 is deployed.
2.3.centos03
The deployment of these three processes fetcher, processor, result_worker is the same as centos02. This server mainly adds webui on the basis of the previous ones.
Create file:/etc/supervisord.d/webui.ini [program:webui] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/webui.log # 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 2724, uptime 0:00:07 processor RUNNING pid 2725, uptime 0:00:07 result_worker RUNNING pid 2723, uptime 0:00:07 webui RUNNING pid 2726, uptime 0:00:07
3. Summary
1.
Python Free video tutorial
2. Python learning manual
3. Python object-oriented video tutorial
The above is the detailed content of Powerful web crawler system: pyspider. For more information, please follow other related articles on the PHP Chinese website!

Pythonisbothcompiledandinterpreted.WhenyourunaPythonscript,itisfirstcompiledintobytecode,whichisthenexecutedbythePythonVirtualMachine(PVM).Thishybridapproachallowsforplatform-independentcodebutcanbeslowerthannativemachinecodeexecution.

Python is not strictly line-by-line execution, but is optimized and conditional execution based on the interpreter mechanism. The interpreter converts the code to bytecode, executed by the PVM, and may precompile constant expressions or optimize loops. Understanding these mechanisms helps optimize code and improve efficiency.

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Atom editor mac version download
The most popular open source editor

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

WebStorm Mac version
Useful JavaScript development tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function
