Home >Backend Development >Python Tutorial >How to example crawler code in python

How to example crawler code in python

coldplay.xixi
coldplay.xixiOriginal
2020-08-11 13:58:529663browse

Python crawler code example method: first obtain the browser information and use urlencode to generate post data; then install pymysql and store the data in MySQL.

How to example crawler code in python

Python crawler code example method:

1, urllib and BeautifuluSoup

Get browser information

from urllib import request
req = request.urlopen("http://www.baidu.com")
print(req.read().decode("utf-8"))

Simulate a real browser: carry user-Agent header

(The purpose is to prevent the server from thinking it is a crawler. If this browser information is not carried, then An error may be reported)

req = request.Request(url) #此处url为某个网址
req.add_header(key,value)  #key即user-Agent,value即浏览器的版本信息
resp = request.urlopen(req)
print(resp.read().decode("utf-8"))

Related learning recommendations: python video tutorial

Use POST

to import parse under the urllib library

from urllib import parse

Use urlencode to generate post data

postData = parse.urlencode([
    (key1,val1),
    (key2,val2),
    (keyn,valn)
])

Use post

request.urlopen(req,data=postData.encode("utf-8")) #使用postData发送post请求
resp.status  #得到请求状态
resp.reason #得到服务器的类型

Complete code example (take crawling Wikipedia home page link as an example)

#-*- coding:utf-8 -*-
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen 
import re
import ssl
#获取维基百科词条信息
ssl._create_default_https_context = ssl._create_unverified_context #全局取消证书验证
#请求URL,并把结果用utf-8编码
req = urlopen("https://en.wikipedia.org/wiki/Main page").read().decode("utf-8")
#使用beautifulsoup去解析
soup = bs(req,"html.parser")
# print(soup)
#获取所有href属性以“/wiki/Special”开头的a标签
urllist = soup.findAll("a",href=re.compile("^/wiki/Special"))
for url in urllist:
#去除以.jpg或.JPG结尾的链接
if not re.search("\.(jpg|JPG)$",url["href"]):
#get_test()输出标签下的所有内容,包括子标签的内容;
#string只输出一个内容,若该标签有子标签则输出“none
print(url.get_text()+"----->"+url["href"])
# print(url)

2. Store data in MySQL

Install pymysql

Install via pip:

$ pip install pymysql

or install the file:

$ python setup.py install

Use

#引入开发包
import pymysql.cursors
#获取数据库链接
connection = pymysql.connect(host="localhost",
user = 'root',
password = '123456',
db ='wikiurl',
charset = 'utf8mb4')
try:
#获取会话指针
with connection.cursor() as cursor
#创建sql语句
sql = "insert into `tableName`(`urlname`,`urlhref`) values(%s,%s)"
#执行SQL语句
cursor.execute(sql,(url.get_text(),"https://en.wikipedia.org"+url["href"]))
#提交
connection.commit()
finally:
#关闭
connection.close()

3. Precautions for crawlers

The full name of the Robots protocol (Robot protocol, also known as the crawler protocol) is the "Web crawler exclusion protocol". The website tells the search engine through the Robots protocol Which pages can be crawled and which pages cannot be crawled. Generally under the main page, such as https://en.wikipedia.org/robots.txt

Disallow:不允许访问
allow:允许访问

Related recommendations: Programming Video Course

The above is the detailed content of How to example crawler code in python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn