집 >백엔드 개발 >파이썬 튜토리얼 >Scrapy 프레임워크를 사용하여 Jingdong 데이터를 반복한 다음 Mysql로 가져오는 방법

Scrapy 프레임워크를 사용하여 Jingdong 데이터를 반복한 다음 Mysql로 가져오는 방법

零到壹度원래의: 2018-03-30 10:20:231943검색

이 기사에서는 주로 scrapy 프레임워크를 사용하여 Jingdong 데이터를 순환적으로 크롤링한 다음 이를 Mysql로 가져오는 방법을 공유합니다. 이는 좋은 참고 가치가 있으며 모든 사람에게 도움이 되기를 바랍니다. 편집자를 따라가서 모두에게 도움이 되기를 바랍니다.

JD.com에는 크롤링 방지 메커니즘이 있어서 사용자 에이전트를 사용하고 브라우저인 척 했습니다.

크롤링된 데이터는 JD몰 휴대폰 정보 URL입니다: https://list.jd.com/list.html?cat=9987,653,655&page=1

약 9,000개 data , 목록에 포함되지 않은 항목은 계산되지 않습니다.

발생한 문제:

1 메서드(use_proxy)로 사용자 에이전트를 캡슐화하는 것이 가장 좋습니다. 왜냐하면 구문 분석에서 직접 코드를 작성했는데 압축을 풀기에 충분한 값이 발견되지 않았기 때문입니다. 문제는 실제로 어느 문장에 오류가 있는지 모르겠습니다. 각 코드 문장을 인쇄한 후 문제가 urlopen()에 있음을 발견했습니다. 그러나 계속해서 시도하고 인터넷을 확인했지만 그렇지 않았습니다. 어디에서 오류가 발생했는지 알 수 없습니다. 메소드를 작성하여 해결했는데, 지금 생각해보면 아마도 구문 분석 메소드가 respose를 처리하기 때문일 것입니다.

2. 데이터를 mysql로 가져오기 전에 먼저 파일로 데이터를 가져오려고 했는데, 가져오는 중에 크기가 덮어씌워져 있는 것을 발견했습니다. 원래는 fh.close()를 쓴 줄 알았습니다. 잘못된 위치에 있는데 갑자기

fh = open("D:/pythonlianxi/result/4.txt", "w") 생각이 났습니다. 'w'를 'a'로 바꿔야 합니다. '.

3. 데이터베이스를 가져올 때 발생하는 주요 문제는 먼저 mysql을 열고 '%char%'와 같은 변수를 표시한 후 데이터베이스의 문자 집합 인코딩 형식을 사용해야 합니다. 내것과 같은 해당 형태의 utf8이라 gbk 사용이 쉽지 않네요. 또한, mysql에 접속하기 위해 작성할 때 charset='utf8'을 잊지 마세요.

다음은 구체적인 코드입니다.

<span style='font-family: 微软雅黑, "Microsoft YaHei"; font-size: 16px;'>conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="jingdong", charset="utf8")<br></span>

<span style='font-family: 微软雅黑, "Microsoft YaHei"; font-size: 16px;'>import scrapy<br>from scrapy.http import Request<br>from jingdong.items import JingdongItem<br>import re<br>import urllib.error<br>import urllib.request<br>import pymysql<br>class JdSpider(scrapy.Spider):<br>    name = 'jd'   <br>    allowed_domains = ['jd.com']    <br>    #start_urls = ['http://jd.com/']    <br>     header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"} <br>         #fh = open("D:/pythonlianxi/result/4.txt", "w")    <br>         def start_requests(self):      <br>             return [Request("https://list.jd.com/list.html?cat=9987,653,655&page=1",callback=self.parse,headers=self.header,meta={"cookiejar":1})]   <br>              def use_proxy(self,proxy_addr,url):       <br>               try:<br>            req=urllib.request.Request(url)<br>            req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36")<br>            proxy = urllib.request.ProxyHandler({"http": proxy_addr})<br>            opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)<br>            urllib.request.install_opener(opener)<br>            data=urllib.request.urlopen(req).read().decode("utf-8","ignore")           <br>             return data      <br>       except urllib.error.URLError as e:         <br>          if hasattr(e,"code"):                <br>             print(e.code)            <br>          if hasattr(e,"reason"):               <br>             print(e.reason)        <br>          except Exception as e:        <br>             print(str(e))   <br>              <br>    def parse(self, response):<br>        item=JingdongItem()<br>        proxy_addr = "61.135.217.7:80"    <br>      try:<br>            item["title"]=response.xpath("//p[@class='p-name']/a[@target='_blank']/em/text()").extract()<br>            item["pricesku"] =response.xpath("//li[@class='gl-item']/p/@data-sku").extract()            <br>            <br>            for j in range(2,166):<br>                url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(j)               <br>                 print(j)                <br>                 #yield item               <br>                  yield Request(url)<br>            pricepat = '"p":"(.*?)"'          <br>              personpat = '"CommentCountStr":"(.*?)",'            <br>              print("2k")            <br>              #fh = open("D:/pythonlianxi/result/5.txt", "a")            <br>              conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="jingdong", charset="utf8")              <br>              <br>         for i in range(0,len(item["pricesku"])):<br>                priceurl="https://p.3.cn/prices/mgets?&ext=11000000&pin=&type=1&area=1_72_4137_0&skuIds="+item["pricesku"][i]<br>                personurl = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + item["pricesku"][i]<br>                pricedata=self.use_proxy(proxy_addr,priceurl)<br>                price=re.compile(pricepat).findall(pricedata)<br>                persondata = self.use_proxy(proxy_addr,personurl)<br>                person = re.compile(personpat).findall(persondata)<br>         <br>                title=item["title"][i]               <br>                print(title)<br>                price1=float(price[0])                <br>                #print(price1)                <br>                person1=person[0]<br>                #fh.write(tile+"\n"+price+"\n"+person+"\n")                <br>                cursor = conn.cursor()<br>                sql = "insert into jd(title,price,person) values(%s,%s,%s);"               <br>                params=(title,price1,person1)                <br>                print("4")<br>                cursor.execute(sql,params)<br>                conn.commit()            <br>                <br>                #fh.close()<br></span>

<span style='font-family: 微软雅黑, "Microsoft YaHei"; font-size: 16px;'>                conn.close()            <br>                return item        <br>                except Exception as e:            <br>                print(str(e))</span><span style='font-family: 微软雅黑, "Microsoft YaHei";'><br></span>

당신은 똑똑하고 그것을 배웠다고 믿습니다. 빨리 가서 연습하세요.

위 내용은 Scrapy 프레임워크를 사용하여 Jingdong 데이터를 반복한 다음 Mysql로 가져오는 방법의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명：

이전 기사：파이썬의 수확량의 정의와 사용법다음 기사：파이썬의 수확량의 정의와 사용법

Scrapy 프레임워크를 사용하여 Jingdong 데이터를 반복한 다음 Mysql로 ​​가져오는 방법

관련 기사

Scrapy 프레임워크를 사용하여 Jingdong 데이터를 반복한 다음 Mysql로 가져오는 방법