search
HomeBackend DevelopmentPHP TutorialDetailed explanation of using anyproxy to improve the efficiency of public account article collection

Let me share with you the advanced usage of anyproxy, and share with you the analysis of how to improve the efficiency of collecting articles from public accounts. Friends who need it can refer to it.

The main influencing factors are the following:

1. Poor network environment;

2. The WeChat client crashes in the mobile phone or simulator;

3. Some other network transmission errors;

Because I pay more attention to the operating cost of the collection system, which includes hardware investment, computing power investment and occupied manual energy. Therefore, the stability of operation must be improved. Therefore, if the collection is interrupted, the cost of manual effort will inevitably increase. So for this point, I made some advanced modifications to anyproxy, and used other tools to improve operating efficiency. The following are the specific solutions:

1. Code upgrade

1) WeChat browser white screen

Solution : Modify the file requestHandler.js, still in the same directory as rule_default.js (mac system/usr/local/lib/node_modules/anyproxy/lib/; netizen cnbattle in the win system comment area provides C:\Users\Administrator\AppData\ Roaming\npm\node_modules\anyproxy\lib)

Find proxyReq.on("error",function(e){this function in the code and modify the content

//userRes.end();//把这一行注释掉
userRes.end(&#39;<script>setTimeout(function(){window.location.reload();},2000);</script>&#39;);//插入这一行

In this way, when an error occurs, a js that refreshes the current page will be returned; so that the program can continue

2) Replace all images to reduce the burden on the browser

First you need to make a very small picture. I made a 1x1 pixel, png transparent picture; put it in any folder. Then modify the code of the file rule_default.js:

Add the following code where there are many vars at the beginning of the file

var fs = require("fs"),
 img = fs.readFileSync("/Library/WebServer/Documents/space.png");//代码绝对路径替换成自己的

In the following code Find shouldUseLocalResponse: function(req,reqBody){function, insert the code inside the function:

if(/mmbiz\.qpic\.cn/i.test(req.url)){
 req.replaceLocalFile = true;
 return true;
}else{
 return false;
}

Continue to find dealLocalResponse: function(req, reqBody,callback){function, insert the code inside the function:

if(req.replaceLocalFile){
 callback(200, {"content-type":"image/png"},img );
}

These three pieces of code will replace all the pictures in the official account with local pictures. Reduce network transmission pressure and the memory occupied by the browser, and effectively improve operating efficiency;

3) Prohibit mobile phones or simulators from accessing some useless and error-causing URLs

Also in rule_default. Find the code replaceRequestOption: function(req,option){function in js, insert the code inside the function:

var newOption = option;
if(/google|btrace/i.test(newOption.headers.host)){//这里面的正则可以替换成自己不希望访问的网址特征字符串,这里面的btrace是一个腾讯视频的域名,经过实践发现特别容易导致浏览器崩溃,所以加在里面了,继续添加可以使用|分割。
 newOption.hostname = "127.0.0.1";//这个ip也可以替换成其他的
 newOption.port  = "80";
}
return newOption;

This modification was also mentioned in the article before , let’s introduce it in detail again here. It has many uses. Different mobile phones and simulators may access some useless addresses, causing the device to slow down. Access can be blocked through this code.

2. Use pm2 to manage anyproxy process

pm2 is a process manager for Node applications with load balancing function.

PM2 is perfect when you want your standalone code to utilize all CPUs on all servers and ensure that the process is always alive with 0 second reloads. It is very suitable for IaaS structures, but do not use it for PaaS solutions (Paas solutions will be developed later).

Main features:

Built-in load balancing (using the Node cluster cluster module)

Background running

0 seconds to stop and reload, I understand it generally means that there is no need to stop during maintenance and upgrades.

With Ubuntu and CentOS startup script

Stop unstable processes (avoid infinite loops)

Console detection

Provide HTTP API

Remote control and real-time interface API (Nodejs module, allows interaction with PM2 process manager)

Tested Nodejs v0.11 v0.10 v0.8 version, compatible with CoffeeScript, based on Linux and MacOS.

First install pm2

sudo npm install -g pm2

Run anyproxy in the pm2 environment

sudo pm2 start anyproxy -x -- -i

Now anyproxy is running in the pm2 environment

There are several pm2 commands that can help manage and monitor anyproxy

//查看运行日志
sudo pm2 logs anyproxy [--lines 10]
//关闭anyproxy
sudo pm2 delete anyproxy
//重启anyproxy
sudo pm2 restart anyproxy
//监控内存占用
sudo pm2 monit
//监控运行状态
sudo pm2 list

Special tip: After pm2 is running, the terminal window can be closed.

The most important purpose of using pm2 to manage the anyproxy process is: after anyproxy exits the program due to an error, pm2 can automatically restart anyproxy.

3. Cancel the sudo password and enable pm2 to start automatically after booting

The following content is the method in the mac environment, and windows should also have it. If you know similar methods, you can send me a private message.

1) First cancel the sudo password

Run the command:

sudo visudo

Find the code:

%admin   ALL = (ALL) ALL

Change to:

%admin   ALL = (ALL) NOPASSWD: ALL

In this way, the sudo password will be cancelled, and then you can add pm2 It’s auto-starting at boot

2) Set up auto-starting at boot

Enter the command in the terminal:

cd
touch autoexec.sh
vim autoexec.sh

Then Enter the editing mode, press the letter i on the keyboard to start editing, and paste the code:

#!/bin/sh 
sudo pm2 start anyproxy -x -- -i
sudo pm2 monit

编辑完之后,按esc,再键入命令wq保存退出编辑模式。

再执行命令:

chmod 755 autoexec.sh

这样一个可执行文件就建立好了

然后打开mac系统的“系统偏好设置”,找到“用户与群组”,在左侧选择当前用户,右侧选择登录项;然后点击+号,找到当前用户的根目录(可以按shift+command+h快捷键),选择autoexec.sh文件,添加到登录项中,就可以开机自启动了。

经过以上的几项设置之后,anyproxy系统就会比原来更加稳定,其实主要原因是模拟器或手机的不稳定导致的anyproxy发生的错误。经过实际测试,anyproxy目前可以长时间运行不崩溃。而微信客户端还是在运行大约6个小时之后崩溃,以2秒翻一页的速度,采集总数大约1万个页面。如果不采集阅读量,就可以是1万个公众号的历史消息页。

微信客户端的崩溃现象是退出微信浏览器,停留在查看公众号资料页面。所以如果希望再进一步提高自动化,也可以使用触动精灵之作自动化脚本,定时推出微信浏览器,再点击历史消息页。这样应该就可以实现长时间自动化采集了。

相关推荐:

PHP写微信公众号文章页采集方法讲解

如何采集微信公众号历史消息页的详解

PHP实现基数排序的方法讲解

The above is the detailed content of Detailed explanation of using anyproxy to improve the efficiency of public account article collection. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How can you check if a PHP session has already started?How can you check if a PHP session has already started?Apr 30, 2025 am 12:20 AM

In PHP, you can use session_status() or session_id() to check whether the session has started. 1) Use the session_status() function. If PHP_SESSION_ACTIVE is returned, the session has been started. 2) Use the session_id() function, if a non-empty string is returned, the session has been started. Both methods can effectively check the session state, and choosing which method to use depends on the PHP version and personal preferences.

Describe a scenario where using sessions is essential in a web application.Describe a scenario where using sessions is essential in a web application.Apr 30, 2025 am 12:16 AM

Sessionsarevitalinwebapplications,especiallyfore-commerceplatforms.Theymaintainuserdataacrossrequests,crucialforshoppingcarts,authentication,andpersonalization.InFlask,sessionscanbeimplementedusingsimplecodetomanageuserloginsanddatapersistence.

How can you manage concurrent session access in PHP?How can you manage concurrent session access in PHP?Apr 30, 2025 am 12:11 AM

Managing concurrent session access in PHP can be done by the following methods: 1. Use the database to store session data, 2. Use Redis or Memcached, 3. Implement a session locking strategy. These methods help ensure data consistency and improve concurrency performance.

What are the limitations of using PHP sessions?What are the limitations of using PHP sessions?Apr 30, 2025 am 12:04 AM

PHPsessionshaveseverallimitations:1)Storageconstraintscanleadtoperformanceissues;2)Securityvulnerabilitieslikesessionfixationattacksexist;3)Scalabilityischallengingduetoserver-specificstorage;4)Sessionexpirationmanagementcanbeproblematic;5)Datapersis

Explain how load balancing affects session management and how to address it.Explain how load balancing affects session management and how to address it.Apr 29, 2025 am 12:42 AM

Load balancing affects session management, but can be resolved with session replication, session stickiness, and centralized session storage. 1. Session Replication Copy session data between servers. 2. Session stickiness directs user requests to the same server. 3. Centralized session storage uses independent servers such as Redis to store session data to ensure data sharing.

Explain the concept of session locking.Explain the concept of session locking.Apr 29, 2025 am 12:39 AM

Sessionlockingisatechniqueusedtoensureauser'ssessionremainsexclusivetooneuseratatime.Itiscrucialforpreventingdatacorruptionandsecuritybreachesinmulti-userapplications.Sessionlockingisimplementedusingserver-sidelockingmechanisms,suchasReentrantLockinJ

Are there any alternatives to PHP sessions?Are there any alternatives to PHP sessions?Apr 29, 2025 am 12:36 AM

Alternatives to PHP sessions include Cookies, Token-based Authentication, Database-based Sessions, and Redis/Memcached. 1.Cookies manage sessions by storing data on the client, which is simple but low in security. 2.Token-based Authentication uses tokens to verify users, which is highly secure but requires additional logic. 3.Database-basedSessions stores data in the database, which has good scalability but may affect performance. 4. Redis/Memcached uses distributed cache to improve performance and scalability, but requires additional matching

Define the term 'session hijacking' in the context of PHP.Define the term 'session hijacking' in the context of PHP.Apr 29, 2025 am 12:33 AM

Sessionhijacking refers to an attacker impersonating a user by obtaining the user's sessionID. Prevention methods include: 1) encrypting communication using HTTPS; 2) verifying the source of the sessionID; 3) using a secure sessionID generation algorithm; 4) regularly updating the sessionID.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.