Home  >  Article  >  Backend Development  >  Example of data capture from Sina News details page

Example of data capture from Sina News details page

PHP中文网
PHP中文网Original
2017-06-21 15:23:291364browse

The previous article "Python Crawler: Capturing Sina News Data" explained in detail how to crawl the relevant data of Sina News details page, but the construction of the code is not conducive to subsequent expansion. Every time a new details page is grabbed, It needs to be written again, so we need to organize it into functions so that it can be called directly.

The 6 data captured by the details page: news title, number of comments, time, source, text, and editor in charge.

First, we organize the number of comments into a functional form:

 1 import requests 2 import json 3 import re 4  5 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 6  7 def getCommentsCount(newsURL): 8     ID = re.search('doc-i(.+).shtml', newsURL) 9     newsID = ID.group(1)10     commentsURL = requests.get(comments_url.format(newsID))11     commentsTotal = json.loads(commentsURL.text.strip('var data='))12     return commentsTotal['result']['count']['total']13 14 news = ''15 print(getCommentsCount(news))

Line 5 comments_url, in the previous article, we Knowing that there is a news ID in the comment link, the number of comments on different news changes through the transformation of the news ID, so we format it and replace the news ID with braces {};

defines the number of comments to obtain The function getCommentsCount uses regular rules to find the matching news ID, then stores the obtained news link in the variable commentsURL, and gets the final number of comments commentsTotal by decoding JS;

Then, we only need to enter the new News link, you can directly call the function getCommentsCount to get the number of comments.

Finally, we organize the 6 data that need to be captured into a function getNewsDetail. As follows:

 1 from bs4 import BeautifulSoup 2 import requests 3 from datetime import datetime 4 import json 5 import re 6  7 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 8  9 def getCommentsCount(newsURL):10     ID = re.search('doc-i(.+).shtml', newsURL)11     newsID = ID.group(1)12     commentsURL = requests.get(comments_url.format(newsID))13     commentsTotal = json.loads(commentsURL.text.strip('var data='))14     return commentsTotal['result']['count']['total']15 16 # news = 'http://news.sina.com.cn/c/nd/2017-05-14/doc-ifyfeius7904403.shtml'17 # print(getCommentsCount(news))18 19 def getNewsDetail(news_url):20     result = {}21     web_data = requests.get(news_url)22     web_data.encoding = 'utf-8'23     soup = BeautifulSoup(web_data.text,'lxml')24     result['title'] = soup.select('#artibodyTitle')[0].text25     result['comments'] = getCommentsCount(news_url)26     time = soup.select('.time-source')[0].contents[0].strip()27     result['dt'] = datetime.strptime(time,'%Y年%m月%d日%H:%M')28     result['source'] = soup.select('.time-source span span a')[0].text29     result['article'] = ' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])30     result['editor'] = soup.select('.article-editor')[0].text.lstrip('责任编辑:')31     return result32 33 print(getNewsDetail(''))

In the function getNewsDetail, obtain the 6 data that need to be captured and put them in the result:

  • result['title'] is to get the news title;

  • resul['comments'] is to get the number of comments. You can directly call the comment count function getCommentsCount we defined at the beginning. ;

  • result['dt'] is the acquisition time; result['source'] is the acquisition source;

  • result['article' ] is to get the text;

  • #result['editor'] is to get the editor in charge.

Then enter the news link you want to obtain data from and call this function.

Part of the results:

##{'title': 'The "instructor" who started teaching Wing Chun at the High School Affiliated to Zhejiang University is a third-generation disciple of Ip Man', ' comments': 618, 'dt': datetime.datetime(2017, 5, 14, 7, 22), 'source': 'China News Network', 'article': 'Original title: Zhejiang University Affiliated High School to start teaching Wing Chun "teacher" "This is Ip Man... Source: Qianjiang Evening News', 'editor': 'Zhang Di'}

The above is the detailed content of Example of data capture from Sina News details page. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:Object-oriented advancedNext article:Object-oriented advanced