search
HomeBackend DevelopmentPython TutorialDetailed explanation of the method of crawling 51cto data in Python and storing it in MySQL

Detailed explanation of the method of crawling 51cto data in Python and storing it in MySQL

[Related learning recommendations: python tutorial

Experimental environment

1. Install Python 3.7

2. Install requests, bs4, pymysql module

Experimental steps 1. Installation environment and module

Please refer to https://www. jb51.net/article/194104.htm

2.Write code

# 51cto 博客页面数据插入mysql数据库
# 导入模块
import re
import bs4
import pymysql
import requests

# 连接数据库账号密码
db = pymysql.connect(host='172.171.13.229',
           user='root', passwd='abc123',
           db='test', port=3306,
           charset='utf8')
# 获取游标
cursor = db.cursor()

def open_url(url):
  # 连接模拟网页访问
  headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
           'Chrome/57.0.2987.98 Safari/537.36'}
  res = requests.get(url, headers=headers)
  return res

# 爬取网页内容
def find_text(res):
  soup = bs4.BeautifulSoup(res.text, 'html.parser')

  # 博客名
  titles = []
  targets = soup.find_all("a", class_="tit")
  for each in targets:
    each = each.text.strip()
    if "置顶" in each:
      each = each.split(' ')[0]
    titles.append(each)

  # 阅读量
  reads = []
  read1 = soup.find_all("p", class_="read fl on")
  read2 = soup.find_all("p", class_="read fl")
  for each in read1:
    reads.append(each.text)
  for each in read2:
    reads.append(each.text)

  # 评论数
  comment = []
  targets = soup.find_all("p", class_='comment fl')
  for each in targets:
    comment.append(each.text)

  # 收藏
  collects = []
  targets = soup.find_all("p", class_='collect fl')
  for each in targets:
    collects.append(each.text)

   # 发布时间
  dates=[]
  targets = soup.find_all("a", class_='time fl')
  for each in targets:
    each = each.text.split(':')[1]
    dates.append(each)

  # 插入sql 语句
  sql = """insert into blog (blog_title,read_number,comment_number, collect, dates)
  values( '%s', '%s', '%s', '%s', '%s');"""
  # 替换页面 \xa0
  for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates):
    reads = re.sub('\s', '', reads)
    comment = re.sub('\s', '', comment)
    collects = re.sub('\s', '', collects)
    cursor.execute(sql % (titles, reads, comment, collects,dates))
    db.commit()
    pass

# 统计总页数
def find_depth(res):
  soup = bs4.BeautifulSoup(res.text, 'html.parser')
  depth = soup.find('li', class_='next').previous_sibling.previous_sibling.text
  return int(depth)

# 主函数
def main():
  host = "https://blog.51cto.com/13760351"
  res = open_url(host) # 打开首页链接
  depth = find_depth(res) # 获取总页数

  # 爬取其他页面信息
  for i in range(1, depth + 1):
    url = host + '/p' + str(i) # 完整链接
    res = open_url(url) # 打开其他链接
    find_text(res) # 爬取数据

  # 关闭游标
  cursor.close()
  # 关闭数据库连接
  db.close()

if __name__ == '__main__':
  main()

3..MySQL creates the corresponding table

CREATE TABLE `blog` (
 `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
 `blog_title` varchar(52) DEFAULT NULL COMMENT '博客标题',
 `read_number` varchar(26) DEFAULT NULL COMMENT '阅读数量',
 `comment_number` varchar(16) DEFAULT NULL COMMENT '评论数量',
 `collect` varchar(16) DEFAULT NULL COMMENT '收藏数量',
 `dates` varchar(16) DEFAULT NULL COMMENT '发布日期',
 PRIMARY KEY (`row_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

## 4. Run the code and check the effect:

Improved version:

Improved content:


1. A certain item in the database Some fields can only retain numbers


2. By default, the content crawled is string, which stores some fields of the database. It is best to change them to integers to facilitate subsequent database operations


1. The code is as follows:

import re
import bs4
import pymysql
import requests

# 连接数据库
db = pymysql.connect(host='172.171.13.229',
           user='root', passwd='abc123',
           db='test', port=3306,
           charset='utf8')
# 获取游标
cursor = db.cursor()

def open_url(url):
  # 连接模拟网页访问
  headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
           'Chrome/57.0.2987.98 Safari/537.36'}
  res = requests.get(url, headers=headers)
  return res

# 爬取网页内容
def find_text(res):
  soup = bs4.BeautifulSoup(res.text, 'html.parser')

  # 博客标题
  titles = []
  targets = soup.find_all("a", class_="tit")
  for each in targets:
    each = each.text.strip()
    if "置顶" in each:
      each = each.split(' ')[0]
    titles.append(each)

  # 阅读量
  reads = []
  read1 = soup.find_all("p", class_="read fl on")
  read2 = soup.find_all("p", class_="read fl")
  for each in read1:
    reads.append(each.text)
  for each in read2:
    reads.append(each.text)

  # 评论数
  comment = []
  targets = soup.find_all("p", class_='comment fl')
  for each in targets:
    comment.append(each.text)

  # 收藏
  collects = []
  targets = soup.find_all("p", class_='collect fl')
  for each in targets:
    collects.append(each.text)

  # 发布时间
  dates=[]
  targets = soup.find_all("a", class_='time fl')
  for each in targets:
    each = each.text.split(':')[1]
    dates.append(each)

  # 插入sql 语句
  sql = """insert into blogs (blog_title,read_number,comment_number, collect, dates)
  values( '%s', '%s', '%s', '%s', '%s');"""
  # 替换页面 \xa0
  for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates):
    reads = re.sub('\s', '', reads)
    reads=int(re.sub('\D', "", reads)) #匹配数字,转换为整型
    comment = re.sub('\s', '', comment)
    comment = int(re.sub('\D', "", comment)) #匹配数字,转换为整型
    collects = re.sub('\s', '', collects)
    collects = int(re.sub('\D', "", collects)) #匹配数字,转换为整型
    dates = re.sub('\s', '', dates)
    cursor.execute(sql % (titles, reads, comment, collects,dates))
    db.commit()
    pass

# 统计总页数
def find_depth(res):
  soup = bs4.BeautifulSoup(res.text, 'html.parser')
  depth = soup.find('li', class_='next').previous_sibling.previous_sibling.text
  return int(depth)

# 主函数
def main():
  host = "https://blog.51cto.com/13760351"
  res = open_url(host) # 打开首页链接
  depth = find_depth(res) # 获取总页数

  # 爬取其他页面信息
  for i in range(1, depth + 1):
    url = host + '/p' + str(i) # 完整链接
    res = open_url(url) # 打开其他链接
    find_text(res) # 爬取数据

  # 关闭游标
  cursor.close()
  # 关闭数据库连接
  db.close()

#主程序入口
if __name__ == '__main__':
  main()

2. Create the corresponding table

CREATE TABLE `blogs` (
 `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
 `blog_title` varchar(52) DEFAULT NULL COMMENT '博客标题',
 `read_number` int(26) DEFAULT NULL COMMENT '阅读数量',
 `comment_number` int(16) DEFAULT NULL COMMENT '评论数量',
 `collect` int(16) DEFAULT NULL COMMENT '收藏数量',
 `dates` varchar(16) DEFAULT NULL COMMENT '发布日期',
 PRIMARY KEY (`row_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

3. Run the code and verify

Upgraded version

In order It allows novices to use this program, and can package this project into an exe format file, so that others can run the code using a computer, which is very convenient!

1. Improve the code:

#末尾修改为:
if __name__ == '__main__':
  main()
  print("\n\t\t所有数据已成功存放数据库!!! \n")
  time.sleep(5)

2. Install the packaging module pyinstaller (cmd installation)

pip install pyinstaller -i https://pypi.tuna. tsinghua.edu.cn/simple/

3. Python code packaging

1. Switch to the path where the code needs to be packaged


2. Run pyinstaller -F test03.py in the cmd window (test03 is the project name)


4. Check the exe package

dist will appear after packaging Directory, the package will be in this directory


5. Run the exe package and check the effect

Check database

Related learning recommendations:

mysql tutorial

The above is the detailed content of Detailed explanation of the method of crawling 51cto data in Python and storing it in MySQL. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:jb51. If there is any infringement, please contact admin@php.cn delete
The Main Purpose of Python: Flexibility and Ease of UseThe Main Purpose of Python: Flexibility and Ease of UseApr 17, 2025 am 12:14 AM

Python's flexibility is reflected in multi-paradigm support and dynamic type systems, while ease of use comes from a simple syntax and rich standard library. 1. Flexibility: Supports object-oriented, functional and procedural programming, and dynamic type systems improve development efficiency. 2. Ease of use: The grammar is close to natural language, the standard library covers a wide range of functions, and simplifies the development process.

Python: The Power of Versatile ProgrammingPython: The Power of Versatile ProgrammingApr 17, 2025 am 12:09 AM

Python is highly favored for its simplicity and power, suitable for all needs from beginners to advanced developers. Its versatility is reflected in: 1) Easy to learn and use, simple syntax; 2) Rich libraries and frameworks, such as NumPy, Pandas, etc.; 3) Cross-platform support, which can be run on a variety of operating systems; 4) Suitable for scripting and automation tasks to improve work efficiency.

Learning Python in 2 Hours a Day: A Practical GuideLearning Python in 2 Hours a Day: A Practical GuideApr 17, 2025 am 12:05 AM

Yes, learn Python in two hours a day. 1. Develop a reasonable study plan, 2. Select the right learning resources, 3. Consolidate the knowledge learned through practice. These steps can help you master Python in a short time.

Python vs. C  : Pros and Cons for DevelopersPython vs. C : Pros and Cons for DevelopersApr 17, 2025 am 12:04 AM

Python is suitable for rapid development and data processing, while C is suitable for high performance and underlying control. 1) Python is easy to use, with concise syntax, and is suitable for data science and web development. 2) C has high performance and accurate control, and is often used in gaming and system programming.

Python: Time Commitment and Learning PacePython: Time Commitment and Learning PaceApr 17, 2025 am 12:03 AM

The time required to learn Python varies from person to person, mainly influenced by previous programming experience, learning motivation, learning resources and methods, and learning rhythm. Set realistic learning goals and learn best through practical projects.

Python: Automation, Scripting, and Task ManagementPython: Automation, Scripting, and Task ManagementApr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python and Time: Making the Most of Your Study TimePython and Time: Making the Most of Your Study TimeApr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Games, GUIs, and MorePython: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.