집 >백엔드 개발 >파이썬 튜토리얼 >Sina News 세부정보 페이지의 데이터 캡처 예

Sina News 세부정보 페이지의 데이터 캡처 예

PHP中文网원래의: 2017-06-21 15:23:291481검색

이전 기사 "Python Crawler: Sina 뉴스 데이터 캡처"에서는 Sina 뉴스 세부정보 페이지의 관련 데이터를 크롤링하는 방법을 자세히 설명했지만, 코드 구성은 새로운 세부정보 페이지가 나올 때마다 도움이 되지 않습니다. 크롤링되면 다시 작성해야 하므로 한 번만 작성하면 쉽게 직접 호출할 수 있도록 함수로 구성해야 합니다.

상세정보 페이지에서 캡처한 6가지 데이터: 뉴스 제목, 댓글 수, 시간, 출처, 텍스트, 담당 편집자.

먼저 댓글 수를 함수형으로 정리합니다:

 1 import requests 2 import json 3 import re 4  5 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 6  7 def getCommentsCount(newsURL): 8     ID = re.search('doc-i(.+).shtml', newsURL) 9     newsID = ID.group(1)10     commentsURL = requests.get(comments_url.format(newsID))11     commentsTotal = json.loads(commentsURL.text.strip('var data='))12     return commentsTotal['result']['count']['total']13 14 news = ''15 print(getCommentsCount(news))

5행 comments_url 이전 기사에서는 댓글 링크에 뉴스 ID가 있고, 댓글 수는 다음과 같습니다. 다른 뉴스가 뉴스를 통해 전달됩니다. ID가 변경되므로 형식을 지정하고 뉴스 ID를 중괄호 {}로 바꿉니다.

댓글 수를 가져오는 getCommentsCount 함수를 정의하고 일반 규칙을 통해 일치하는 뉴스 ID를 찾은 다음 저장합니다. 얻은 뉴스 링크 commentsURL 변수를 입력하고 JS를 디코딩하여 최종 댓글 수인 commentsTotal을 얻습니다.

그런 다음 새 뉴스 링크를 입력하고 getCommentsCount 함수를 직접 호출하기만 하면 댓글 수를 얻을 수 있습니다.

마지막으로 캡처해야 하는 6개의 데이터를 getNewsDetail 함수로 구성합니다. 다음과 같습니다:

 1 from bs4 import BeautifulSoup 2 import requests 3 from datetime import datetime 4 import json 5 import re 6  7 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 8  9 def getCommentsCount(newsURL):10     ID = re.search('doc-i(.+).shtml', newsURL)11     newsID = ID.group(1)12     commentsURL = requests.get(comments_url.format(newsID))13     commentsTotal = json.loads(commentsURL.text.strip('var data='))14     return commentsTotal['result']['count']['total']15 16 # news = 'http://news.sina.com.cn/c/nd/2017-05-14/doc-ifyfeius7904403.shtml'17 # print(getCommentsCount(news))18 19 def getNewsDetail(news_url):20     result = {}21     web_data = requests.get(news_url)22     web_data.encoding = 'utf-8'23     soup = BeautifulSoup(web_data.text,'lxml')24     result['title'] = soup.select('#artibodyTitle')[0].text25     result['comments'] = getCommentsCount(news_url)26     time = soup.select('.time-source')[0].contents[0].strip()27     result['dt'] = datetime.strptime(time,'%Y年%m月%d日%H:%M')28     result['source'] = soup.select('.time-source span span a')[0].text29     result['article'] = ' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])30     result['editor'] = soup.select('.article-editor')[0].text.lstrip('责任编辑：')31     return result32 33 print(getNewsDetail(''))

함수 getNewsDetail에서 캡처해야 하는 6개의 데이터를 가져와서 결과에 넣습니다.

result['title']은 뉴스 제목을 가져오는 것입니다.
resul[ 'comments']는 댓글 개수를 가져오는 것입니다. 처음에 정의한 댓글 개수 함수 getCommentsCount를 직접 호출할 수 있습니다.
result['dt']는 획득 시간입니다. source']는 획득 소스입니다.
result ['article']은 본문을 가져오는 것이고
result['editor']는 편집자를 가져오는 것입니다.

그런 다음 데이터를 얻으려는 뉴스 링크를 입력하고 이 함수를 호출하세요.

결과의 일부:

{'제목': '절강대학교 부속 고등학교에서 가르치는 영춘권의 '선생님'은 입맨의 3세대 제자입니다', '댓글': 618, 'dt': datetime.datetime(2017 , 5, 14, 7, 22), '출처': 'China News Network', 'article': '원문: 절강대학교 부속 고등학교에서 영춘권 강의 시작' 강사 " IP Man... 출처: Qianjiang Evening News', '편집자': 'Zhang Di'}

위 내용은 Sina News 세부정보 페이지의 데이터 캡처 예의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명：

이전 기사：객체지향 고급다음 기사：객체지향 고급