집 >백엔드 개발 >파이썬 튜토리얼 >Python이 뉴스 수집 프로젝트를 만듭니다.

Python이 뉴스 수집 프로젝트를 만듭니다.

php中世界最好的语言원래의: 2018-04-09 13:44:302626검색

이번에는 뉴스 집계 프로젝트를 만들기 위해 Python을 가져오겠습니다. Python에서 뉴스 집계 프로젝트를 만들 때의 주의 사항은 무엇입니까?

코드부터 시작해 하나씩 분석해 보겠습니다.

from nntplib import NNTP
from time import strftime,time,localtime
from email import message_from_string
from urllib import urlopen
import textwrap
import re
day = 24*60*60
def wrap(string,max=70):
    '''
    '''
    return '\n'.join(textwrap.wrap(string)) + '\n'
class NewsAgent:
    '''
    '''
    def init(self):
        self.sources = []
        self.destinations = []
    def addSource(self,source):
        self.sources.append(source)
    def addDestination(self,dest):
        self.destinations.append(dest)
    def distribute(self):
        items = []
        for source in self.sources:
            items.extend(source.getItems())
        for dest in self.destinations:
            dest.receiveItems(items)
class NewsItem:
    def init(self,title,body):
        self.title = title
        self.body = body
class NNTPSource:
    def init(self,servername,group,window):
        self.servername = servername
        self.group = group
        self.window = window
    def getItems(self):
        start = localtime(time() - self.window*day)
        date = strftime('%y%m%d',start)
        hour = strftime('%H%M%S',start)
        server = NNTP(self.servername)
        ids = server.newnews(self.group,date,hour)[1]
        for id in ids:
            lines = server.article(id)[3]
            message = message_from_string('\n'.join(lines))
            title = message['subject']
            body = message.get_payload()
            if message.is_multipart():
                body = body[0]
            yield NewsItem(title,body)
        server.quit()
class SimpleWebSource:
    def init(self,url,titlePattern,bodyPattern):
        self.url = url
        self.titlePattern = re.compile(titlePattern)
        self.bodyPattern = re.compile(bodyPattern)
    def getItems(self):
        text = urlopen(self.url).read()
        titles = self.titlePattern.findall(text)
        bodies = self.bodyPattern.findall(text)
        for title.body in zip(titles,bodies):
            yield NewsItem(title,wrap(body))
class PlainDestination:
    def receiveItems(self,items):
        for item in items:
            print item.title
            print '-'*len(item.title)
            print item.body
class HTMLDestination:
    def init(self,filename):
        self.filename = filename
    def receiveItems(self,items):
        out = open(self.filename,'w')
        print >> out,'''
        <html>
        <head>
         <title>Today's News</title>
        </head>
        <body>
        <h1>Today's News</hi>
        '''
        print >> out, '<ul>'
        id = 0
        for item in items:
            id += 1
            print >> out, '<li><a href="#" rel="external nofollow" >%s</a></li>' % (id,item.title)
        print >> out, '</ul>'
        id = 0
        for item in items:
            id += 1
            print >> out, '<h2><a name="%i">%s</a></h2>' % (id,item.title)
            print >> out, '<pre class="brush:php;toolbar:false">%s

' % item.body print >> out, ''' ''' def runDefaultSetup(): agent = NewsAgent() bbc_url = 'http://news.bbc.co.uk/text_only.stm' bbc_title = r'(?s)a href="[^" rel="external nofollow" ]*">\s*\s*(.*?)\s*' bbc_body = r'(?s)\s*
\s*(.*?)\s*<' bbc = SimpleWebSource(bbc_url, bbc_title, bbc_body) agent.addSource(bbc) clpa_server = 'news2.neva.ru' clpa_group = 'alt.sex.telephone' clpa_window = 1 clpa = NNTPSource(clpa_server,clpa_group,clpa_window) agent.addSource(clpa) agent.addDestination(PlainDestination()) agent.addDestination(HTMLDestination('news.html')) agent.distribute() if name == 'main': runDefaultSetup()

먼저 이 프로그램을 전체적으로 분석해 보겠습니다. 핵심 부분은 뉴스 소스와 대상 주소를 저장한 다음 소스 서버를 호출하는 것입니다. (NNTPSource 및 SimpleWebSource) 및 뉴스 작성을 위한 클래스(PlainDestination 및 HTMLDestination). 따라서 여기에서 NNTPSource는 뉴스 서버에 대한 정보를 얻는 데 특별히 사용되고 SimpleWebSource는 URL에 대한 데이터를 얻는 데 사용된다는 것을 알 수 있습니다. PlainDestination과 HTMLDestination의 기능은 분명합니다. 전자는 얻은 콘텐츠를 터미널에 출력하는 데 사용되고 후자는 html 파일에 데이터를 쓰는 데 사용됩니다.

이러한 분석을 통해 메인 프로그램의 내용을 살펴보겠습니다. 메인 프로그램은 NewsAgent에 정보 소스와 출력 대상 주소를 추가하는 것입니다.

정말 간단한 프로그램이지만 이 프로그램은 레이어링을 사용합니다.

이 기사의 사례를 읽은 후 방법을 마스터했다고 생각합니다. 더 흥미로운 정보를 보려면 PHP 중국어 웹사이트의 다른 관련 기사를 주목하세요!