Home >Backend Development >PHP Tutorial >Detailed explanation on how to collect historical message pages of WeChat public accounts
I will explain to you how to obtain information on the entry history message page collected from WeChat public account articles. Friends in need may refer to this content.
Collecting WeChat articles is the same as collecting website content. You need to start from a list page. The list page of WeChat articles is the view history message page in the official account. Many other WeChat collectors on the Internet now use Sogou to search. Although the collection method is much simpler, the content is incomplete. Therefore, we still have to collect it from the most standard and comprehensive public account history message page.
Due to the limitations of WeChat, the link we can copy is incomplete and cannot be opened in the browser to see the content. Therefore, we need to use anyproxy to obtain the link address of a complete WeChat public account historical message page through the method introduced in the previous article.
http://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5NDAwMTA2MA==&uin=NzM4MTk1ODgx&key=bf9387c4d02682e186a298a18276d8e0555e3ab51d81ca46de339e6082eb767343 bef610edd80c9e1bfda66c2b62751511f7cc091a33a029709e94f0d1604e11220fc099a27b2e2d29db75cc0849d4bf&devicetype=android-17&version=26031c34&lang=zh_CN&nettype=WIFI&as cene=3&pass_ticket=Iox5ZdpRhrSxGYEeopVJwTBP7kZj51GYyEL24AT5Zyx+BoEMdPDBtOun1F/9ENSz&wx_header =1
As mentioned in the previous article, the biz parameter is the ID of the official account, and uin is the user's ID. Currently, uin is unique among all official accounts. The other two important parameters key and pass_ticket are supplementary parameters on the WeChat client.
So before this address expires, we can get the article list of historical messages by viewing the original text with a browser. If we want to automatically analyze the content, we can also make a program to add this address with the address that has not yet expired. Submit the link address of key and pass_ticket, and then obtain the article list through a php program, for example.
Recently, a friend told me that his collection target is a single public account. I think this makes it unnecessary to use the batch collection method written in the previous article. So let's take a look at how to get the article list in the historical message page. By analyzing the article list, we can get all the content link addresses of this official account, and then collect the content.
If the certificate is configured correctly in the anyproxy web interface, the https content can be displayed. The address of the web interface is http://localhost:8002, where localhost can be replaced with your own IP address or domain name. Find the record starting with getmasssendmsg from the list. After clicking it, the details of this record will be displayed on the right side:
The red box part is the complete link address. WeChat public After the domain name of the platform is spliced in front, it can be opened in the browser.
Then pull the page down to the end of the html content. We can see a json variable that is a list of historical news articles:
We copy the variable value of msgList and analyze it with the json formatting tool. We can see that the json has the following structure:
{ "list": [ { "app_msg_ext_info": { "author": "", "content": "", "content_url": "http://mp.weixin.qq.com/s?__biz=MzA5MzEzNDg3MQ==&mid=2652767427&idx=1&sn=37da0d7208283bf90e9a4a536e0af0ea&chksm=8b882dbbbcffa4ad2f0b8a141cc988d16bace564274018e68e5c53ee6f354f8ad56c9b98bade&scene=4#wechat_redirect", "copyright_stat": 100, "cover": "http://mmbiz.qpic.cn/mmbiz/MofBAcBsJ6X0xGrQ2XK5yQjzwb2eswxkRNBTgLtcqGziaFqwibzvtZAHCDkMeJU1fGZHpjoeibanPJ8rziaq68Akkg/0?wx_fmt=jpeg", "digest": "擦亮双眼,远离谣言。", "fileid": 505283695, "is_multi": 1, "multi_app_msg_item_list": [ { "author": "", "content": "", "content_url": "http://mp.weixin.qq.com/s?__biz=MzA5MzEzNDg3MQ==&mid=2652767427&idx=2&sn=449ef1a874a37fed2429e14f724b56ef&chksm=8b882dbbbcffa4ade48a7932cda4263687e34fca8ea3a5a6233d2589d448b9f6130d3890ce93&scene=4#wechat_redirect", "copyright_stat": 100, "cover": "http://mmbiz.qpic.cn/mmbiz_png/MofBAcBsJ6XyaIn0qEDSSicBUBZbMYHYrhibia89ZnksCsUiaia2TLI1fyqjclibGa1hw3icP6oXeSpaWMjiabaghHl7yw/0?wx_fmt=png", "digest": "12月28日,广州亚运城综合体育馆,内附购票入口~", "fileid": 0, "source_url": "http://wechat.show.wepiao.com/detail/ff764b0731b7465db03b56b998e1f2b8?detailReferrer=1&from=groupmessage&isappinstalled=0", "title": "2017微信公开课Pro版即将召开" }, ...//循环被省略 ], "source_url": "", "subtype": 9, "title": "谣言热榜 | 十一月朋友圈十大谣言" }, "comm_msg_info": { "content": "", "datetime": 1480933315, "fakeid": "3093134871", "id": 1000000010, "status": 2, "type": 49 //类型为49的时候是图文消息 } }, ...//循环被省略 ] }
Briefly analyze this json (only some important information is introduced here, others are omitted):
"list": [ //最外层的键名;只出现一次,所有内容都被它包含。 {//这个大阔号之内是一条多图文或单图文消息,通俗的说就是一天的群发都在这里 "app_msg_ext_info":{//图文消息的扩展信息 "content_url": "图文消息的链接地址", "cover": "封面图片", "digest": "摘要", "is_multi": "是否多图文,值为1和0", "multi_app_msg_item_list": [//这里面包含的是从第二条开始的图文消息,如果is_multi=0,这里将为空 { "content_url": "图文消息的链接地址", "cover": "封面图片", "digest": ""摘要"", "source_url": "阅读原文的地址", "title": "子内容标题" }, ...//循环被省略 ], "source_url": "阅读原文的地址", "title": "头条标题" }, "comm_msg_info":{//图文消息的基本信息 "datetime": '发布时间,值为unix时间戳', "type": 49 //类型为49的时候是图文消息 } }, ...//循环被省略 ]
One more thing to mention here is that if you want To obtain the content of historical messages that are older, you need to pull the page down on your mobile phone or simulator. When you pull it to the bottom, WeChat will automatically read the content of the next page. The link address of the next page and the link address of the historical message page are also addresses starting with getmasssendmsg. But the content is only json, not html. Just parse json directly.
At this time, you can use the method introduced in the previous article to use anyproxy to match the msgList variable value regularly, submit it to the server asynchronously, and then use php's json_decode from the server to parse the json into an array. Then loop through the array. We can get the title and link address of each article.
If you only need to collect the content of a single public account, you can obtain the complete link address with key and pass_ticket through anyproxy after sending in bulk every day. Then make a program yourself and manually submit the address to your program. Use a language such as php to regularly match msgList and then parse json. In this way, there is no need to modify the rules of anyproxy, and there is no need to create a collection queue and jump page.
Related recommendations:
Explanation of the method of implementing radix sorting in PHP
How PHP implements automatic dependency injection based on the reflection mechanism Explanation
PHP ongoing-detailed explanation of variables and dynamic string insertion of variables
The above is the detailed content of Detailed explanation on how to collect historical message pages of WeChat public accounts. For more information, please follow other related articles on the PHP Chinese website!