基于scrapy实现的简单蜘蛛采集程序-파이썬 튜토리얼-php.cn

집

백엔드 개발

파이썬 튜토리얼

基于scrapy实现的简单蜘蛛采集程序

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 10, 2016 pm 03:14 PM

scrapy거미

本文实例讲述了基于scrapy实现的简单蜘蛛采集程序。分享给大家供大家参考。具体如下：

# Standard Python library imports
# 3rd party imports
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
# My imports
from poetry_analysis.items import PoetryAnalysisItem
HTML_FILE_NAME = r'.+\.html'
class PoetryParser(object):
  """
  Provides common parsing method for poems formatted this one specific way.
  """
  date_pattern = r'(\d{2} \w{3,9} \d{4})'
 
  def parse_poem(self, response):
    hxs = HtmlXPathSelector(response)
    item = PoetryAnalysisItem()
    # All poetry text is in pre tags
    text = hxs.select('//pre/text()').extract()
    item['text'] = ''.join(text)
    item['url'] = response.url
    # head/title contains title - a poem by author
    title_text = hxs.select('//head/title/text()').extract()[0]
    item['title'], item['author'] = title_text.split(' - ')
    item['author'] = item['author'].replace('a poem by', '')
    for key in ['title', 'author']:
      item[key] = item[key].strip()
    item['date'] = hxs.select("//p[@class='small']/text()").re(date_pattern)
    return item
class PoetrySpider(CrawlSpider, PoetryParser):
  name = 'example.com_poetry'
  allowed_domains = ['www.example.com']
  root_path = 'someuser/poetry/'
  start_urls = ['http://www.example.com/someuser/poetry/recent/',
         'http://www.example.com/someuser/poetry/less_recent/']
  rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),
                  callback='parse_poem'),
       Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),
                  callback='parse_poem')]

希望本文所述对大家的Python程序设计有所帮助。

성명

본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.

관련 기사

Python의 하이브리드 접근법 : 컴파일 및 해석 결합May 08, 2025 am 12:16 AM

PythonuseSahybrideactroach, combingingcompytobytecodeandingretation.1) codeiscompiledToplatform-IndependentBecode.2) bytecodeistredbythepythonvirtonmachine, enterancingefficiency andportability.

Python 's 'for'와 'whind'루프의 차이점을 배우십시오May 08, 2025 am 12:11 AM

"for"and "while"loopsare : 1) "에 대한"loopsareIdealforitertatingOverSorkNowniterations, whide2) "weekepindiTeRations.Un

Python Concatenate는 중복과 함께 목록입니다May 08, 2025 am 12:09 AM

Python에서는 다양한 방법을 통해 목록을 연결하고 중복 요소를 관리 할 수 있습니다. 1) 연산자를 사용하거나 ()을 사용하여 모든 중복 요소를 유지합니다. 2) 세트로 변환 한 다음 모든 중복 요소를 제거하기 위해 목록으로 돌아가지 만 원래 순서는 손실됩니다. 3) 루프 또는 목록 이해를 사용하여 세트를 결합하여 중복 요소를 제거하고 원래 순서를 유지하십시오.

파이썬 목록 연결 성능 : 속도 비교May 08, 2025 am 12:09 AM

fastestestestedforListCancatenationInpythondSpendsonListsize : 1) Forsmalllist, OperatoriseFficient.2) ForlargerLists, list.extend () OrlistComprehensionIsfaster, withextend () morememory-efficientBymodingListsin-splace.

Python 목록에 요소를 어떻게 삽입합니까?May 08, 2025 am 12:07 AM

toInsertElmentsIntoapyThonList, useAppend () toaddtotheend, insert () foraspecificposition, andextend () andextend () formultipleElements.1) useappend () foraddingsingleitemstotheend.2) useinsert () toaddatespecificindex, 그러나)

Python은 후드 아래에 동적 배열 또는 링크 된 목록이 있습니까?May 07, 2025 am 12:16 AM

pythonlistsareimplementedesdynamicarrays, notlinkedlists.1) thearestoredIntIguousUousUousUousUousUousUousUousUousUousInSeripendExeDaccess, LeadingSpyTHOCESS, ImpactingEperformance

파이썬 목록에서 요소를 어떻게 제거합니까?May 07, 2025 am 12:15 AM

PythonoffersfourmainmethodstoremoveElementsfromalist : 1) 제거 (값) 제거 (값) removesthefirstoccurrencefavalue, 2) pop (index) 제거 elementatAspecifiedIndex, 3) delstatemeveselementsByindexorSlice, 4) RemovesAllestemsfromTheChmetho

스크립트를 실행하려고 할 때 '허가 거부'오류가 발생하면 무엇을 확인해야합니까?May 07, 2025 am 12:12 AM

Toresolvea "permissionDenied"오류가 발생할 때 오류가 발생합니다.

See all articles

핫 AI 도구

Undresser.AI Undress

사실적인 누드 사진을 만들기 위한 AI 기반 앱

AI Clothes Remover

사진에서 옷을 제거하는 온라인 AI 도구입니다.

Undress AI Tool

무료로 이미지를 벗다

Clothoff.io

AI 옷 제거제

Video Face Swap

완전히 무료인 AI 얼굴 교환 도구를 사용하여 모든 비디오의 얼굴을 쉽게 바꾸세요!

뜨거운 도구

SecList

SecLists는 최고의 보안 테스터의 동반자입니다. 보안 평가 시 자주 사용되는 다양한 유형의 목록을 한 곳에 모아 놓은 것입니다. SecLists는 보안 테스터에게 필요할 수 있는 모든 목록을 편리하게 제공하여 보안 테스트를 더욱 효율적이고 생산적으로 만드는 데 도움이 됩니다. 목록 유형에는 사용자 이름, 비밀번호, URL, 퍼징 페이로드, 민감한 데이터 패턴, 웹 셸 등이 포함됩니다. 테스터는 이 저장소를 새로운 테스트 시스템으로 간단히 가져올 수 있으며 필요한 모든 유형의 목록에 액세스할 수 있습니다.