search
HomeBackend DevelopmentPython TutorialPython crawler simulates logging into the Academic Affairs Office and saves data locally

Just started to get in touchPython, I saw many people playing crawlers I also want to play. After searching around, I found that the first thing many people do with web crawlers is to simulate login. To make it more difficult, they simulate login and then obtain data. However, there are very few Python 3.x demos on the Internet that can simulate login. For reference, and I don’t know much about Html, so writing this first Python crawler was extremely difficult, but the final result was satisfactory. Let’s sort out the learning process this time.

Tools

  • System: win7 64-bit system

  • Browser:Chrome

  • Python version: Python 3.5 64-bit

  • IDE: JetBrains PyCharm (seems like many people use this)

I aimed at our Academic Affairs Office. The purpose of this crawler is to obtain scores from the Academic Affairs Office and enter the scores into ExcelForm to save them. The address of our school’s Academic Affairs Office is: http:/ /jwc.ecjtu.jx.cn/, usually every time we get the results, we need to enter the Academic Affairs Office first, then click on the resultsquery, enter the public account password to enter, and finally enter the relevant information to obtain the score form, here Login does not require a verification code, which saved me a lot of effort, so we first enter the score query system login interface, first see how to simulate the login process, and press F12 in the Chrome browser to open the developer panel:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

Developer Panel

Here the password for the inquiry system of our school’s Academic Affairs Office is the public jwc, which is the pinyin abbreviation. We enter the user name and password Click to log in. At this time, pay attention to POST request:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

##Pay attention to post request

What was found, it seems Chrome did not retain the form information submitted by Post and directly jumped to another interface and then displayed the data of the other interface. Here we need to do it ourselves. Note that the small red dot in the upper left corner of the developer panel indicates that it is grabbing at this time. Get the data. If you click it, it will turn gray, and you can save the captured package in disguise. I clicked this little red dot before the new interface was refreshed after clicking login, and I got the Post form as I wished. Data:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

Get post form data

In this way, the form data passed by the browser to the server when logging in is obtained. Take a look at this form. What are they:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

View form data

Here we see that we need to pass three parameters, namely: user, pass, Submit, It is easy to understand the literal meaning of these words. With this idea, we can write the first step of this code:

Simulated login to the Academic Affairs Office Go directly to the code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
url = 'http://jwc.ecjtu.jx.cn/mis_o/login.php'
datas = {'user': 'jwc',
         'pass': 'jwc',
         'Submit': '%CC%E1%BD%BB'
         }
headers = {'Referer': 'http://jwc.ecjtu.jx.cn/mis_o/login.htm',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 '
                         '(KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Language': 'zh-CN,zh;q=0.8',
           }
sessions = requests.session()
response = sessions.post(url, headers=headers, data=datas)
print(response.status_code)
Code output:

200
It means that our simulated login is successful. The Requests module is used here. If you don’t know how to use it, you can check the Chinese document. Its definition is: HTTP

for Humans, because it is easy to use and easy to use, we only need to pass in the Url address, construct the request header, and pass in the data required by the post method to simulate browser login. Since there is an operation to further obtain results, session is used here. Keep the connection. If we just look at the final return code here, we are successful. How to do it depends on the next step. Next:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

Capture Packet

In order to simplify the code, we set up to enter the student number to query all scores and reduce other judgments. We also capture the Post data:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

Yes Post data capture

Also check the Post data:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

View the post data

Because here we only analyze and enter the student number So everything else is empty, so we can write the code to query the results:

    score_healders = {'Connection': 'keep-alive',
                      'User - Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
                      'Content - Type': 'application / x - www - form - urlencoded',
                      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                      'Content - Length': '69',
                      'Host': 'jwc.ecjtu.jx.cn',
                      'Referer': 'http: // jwc.ecjtu.jx.cn / mis_o / main.php',
                      'Upgrade - Insecure - Requests': '1',
                      'Accept - Language': 'zh - CN, zh;q = 0.8'
                      }
    score_url = 'http://jwc.ecjtu.jx.cn/mis_o/query.php?start=' + str(
        pagenum) + '&job=see&=&Name=&Course=&ClassID=&Term=&StuID=' + num
    score_data = {'Name': '',
                  'StuID': num,
                  'Course': '',
                  'Term': '',
                  'ClassID': '',
                  'Submit': '%B2%E9%D1%AF'
                  }

    score_response = sessions.post(score_url, data=score_data, headers=score_healders)
    content = score_response.content

这里解释一下上面的代码,上面的score_url 并不是浏览器上显示的地址,我们要获取真正的地址,在Chrome下右键--查看网页源代码,找到这么一行:

a href=query.php?start=1&job=see&=&Name=&Course=&ClassID=&Term=&StuID=xxxxxxx

这个才是真正的地址,点击这个地址转入的才是真正的界面,因为这里成绩数据较多,所以这里采用了分页显示,这个start=1说明是第一页,这个参数是可变的需要我们传入,还有StuID后面的是我们输入的学号,这样我们就可以拼接出Url地址:

score_url = 'http://jwc.ecjtu.jx.cn/mis_o/query.php?start=' + str(pagenum) + '&job=see&=&Name=&Course=&ClassID=&Term=&StuID=' + num

同样使用Post方法传递数据并获取响应的内容:

score_response = sessions.post(score_url, data=score_data,headers=score_healders)
content = score_response.content

这里采用Beautiful Soup 4.2.0来解析返回的响应内容,因为我们要获取的是成绩,这里到教务处成绩查询界面,查看获取到的成绩在网页中是以表格的形式存在:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

网页源代码


观察表格的网页源代码:



...
...
学期 学号 姓名 课程 课程要求 学分 成绩 重考一 重考二

这里拿出第一行举例,虽然我不太懂Html但是从这里可以看出来<tr> 代表的是一行,而<code><td>应该是代表这一行中的每一列,这样就好办了,取出每一行然后分解出每一列,打印输出就可以得到我们要的结果:<pre class="brush:php;toolbar:false">from bs4 import BeautifulSoup soup = BeautifulSoup(content, 'html.parser') # 找到每一行 target = soup.findAll('tr')</pre> <p>这里分解每一列的时候要小心,因为这里表格分成了三页显示,每页最多显示30条数据,这里因为只是收集已经毕业的学生的成绩数据所以不对其他数据量不足的学生成绩的情况做统计,默认收集的都是大四毕业的学生成绩数据。这里采用两个<a href="http://www.php.cn/wiki/70.html" target="_blank">变量</a><code>ij分别代表行和列:

# 注:这里的print单纯是我为了验证结果打印在PyCharm的控制台上而已
i=0, j=0
for tag in target[1:]:
            tds = tag.findAll('td')
            # 每一次都是从列头开始获取
            j = 0
            # 学期
            semester = str(tds[0].string)
            if semester == 'None':
                break
            else:
                print(semester.ljust(6) + '\t\t\t', end='')
            # 学号
            studentid = tds[1].string
            print(studentid.ljust(14) + '\t\t\t', end='')
            j += 1
            # 姓名
            name = tds[2].string
            print(name.ljust(3) + '\t\t\t', end='')
            j += 1
            # 课程
            course = tds[3].string
            print(course.ljust(20, ' ') + '\t\t\t', end='')
            j += 1
            # 课程要求
            requirments = tds[4].string
            print(requirments.ljust(10, ' ') + '\t\t', end='')
            j += 1
            # 学分
            scredit = tds[5].string
            print(scredit.ljust(2, ' ') + '\t\t', end='')
            j += 1
            # 成绩
            achievement = tds[6].string
            print(achievement.ljust(2) + '\t\t', end='')
            j += 1
            # 重考一
            reexaminef = tds[7].string
            print(reexaminef.ljust(2) + '\t\t', end='')
            j += 1
            # 重考二
            reexamines = tds[8].string
            print(reexamines.ljust(2) + '\t\t')
            j += 1
            i += 1

这里查了很多别人的博客都是用正则表达式来分解数据,表示自己的正则写的并不好也尝试了但是没成功,所以无奈选择这种方式,如果有人有测试成功的正则欢迎跟我说一声,我也学习学习。

把数据保存到Excel

因为已经清楚了这个网页保存成绩的具体结构,所以顺着每次循环解析将数据不断加以保存就是了,这里使用xlwt写入数据到Excel,因为xlwt模块打印输出到Excel中的样式宽度偏小,影响观看,所以这里还加入了一个方法去控制打印到Excel表格中的样式:

file = xlwt.Workbook(encoding='utf-8')
table = file.add_sheet('achieve')
# 设置Excel样式
def set_style(name, height, bold=False):
    style = xlwt.XFStyle()  # 初始化样式
    font = xlwt.Font()  # 为样式创建字体
    font.name = name  # 'Times New Roman'
    font.bold = bold
    font.color_index = 4
    font.height = height
    style.font = font
    return style

运用到代码中:

for tag in target[1:]:
            tds = tag.findAll('td')
            j = 0
            # 学期
            semester = str(tds[0].string)
            if semester == 'None':
                break
            else:
                print(semester.ljust(6) + '\t\t\t', end='')
                table.write(i, j, semester, set_style('Arial', 220))
            # 学号
            studentid = tds[1].string
            print(studentid.ljust(14) + '\t\t\t', end='')
            j += 1
            table.write(i, j, studentid, set_style('Arial', 220))
            table.col(i).width = 256 * 16
            # 姓名
            name = tds[2].string
            print(name.ljust(3) + '\t\t\t', end='')
            j += 1
            table.write(i, j, name, set_style('Arial', 220))
            # 课程
            course = tds[3].string
            print(course.ljust(20, ' ') + '\t\t\t', end='')
            j += 1
            table.write(i, j, course, set_style('Arial', 220))
            # 课程要求
            requirments = tds[4].string
            print(requirments.ljust(10, ' ') + '\t\t', end='')
            j += 1
            table.write(i, j, requirments, set_style('Arial', 220))
            # 学分
            scredit = tds[5].string
            print(scredit.ljust(2, ' ') + '\t\t', end='')
            j += 1
            table.write(i, j, scredit, set_style('Arial', 220))
            # 成绩
            achievement = tds[6].string
            print(achievement.ljust(2) + '\t\t', end='')
            j += 1
            table.write(i, j, achievement, set_style('Arial', 220))
            # 重考一
            reexaminef = tds[7].string
            print(reexaminef.ljust(2) + '\t\t', end='')
            j += 1
            table.write(i, j, reexaminef, set_style('Arial', 220))
            # 重考二
            reexamines = tds[8].string
            print(reexamines.ljust(2) + '\t\t')
            j += 1
            table.write(i, j, reexamines, set_style('Arial', 220))
            i += 1

file.save('demo.xls')

最后稍加整合,写成一个方法:

# 获取成绩
# 这里num代表输入的学号,pagenum代表页数,总共76条数据,一页30条所以总共有三页
def getScore(num, pagenum, i, j):
    score_healders = {'Connection': 'keep-alive',
                      'User - Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
                      'Content - Type': 'application / x - www - form - urlencoded',
                      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                      'Content - Length': '69',
                      'Host': 'jwc.ecjtu.jx.cn',
                      'Referer': 'http: // jwc.ecjtu.jx.cn / mis_o / main.php',
                      'Upgrade - Insecure - Requests': '1',
                      'Accept - Language': 'zh - CN, zh;q = 0.8'
                      }
    score_url = 'http://jwc.ecjtu.jx.cn/mis_o/query.php?start=' + str(
        pagenum) + '&job=see&=&Name=&Course=&ClassID=&Term=&StuID=' + num
    score_data = {'Name': '',
                  'StuID': num,
                  'Course': '',
                  'Term': '',
                  'ClassID': '',
                  'Submit': '%B2%E9%D1%AF'
                  }

    score_response = sessions.post(score_url, data=score_data, headers=score_healders)
    # 输出到文本
    with open('text.txt', 'wb') as f:
        f.write(score_response.content)
    content = score_response.content
    soup = BeautifulSoup(content, 'html.parser')
    target = soup.findAll('tr')
    try:
        for tag in target[1:]:
            tds = tag.findAll('td')
            j = 0
            # 学期
            semester = str(tds[0].string)
            if semester == 'None':
                break
            else:
                print(semester.ljust(6) + '\t\t\t', end='')
                table.write(i, j, semester, set_style('Arial', 220))
            # 学号
            studentid = tds[1].string
            print(studentid.ljust(14) + '\t\t\t', end='')
            j += 1
            table.write(i, j, studentid, set_style('Arial', 220))
            table.col(i).width = 256 * 16
            # 姓名
            name = tds[2].string
            print(name.ljust(3) + '\t\t\t', end='')
            j += 1
            table.write(i, j, name, set_style('Arial', 220))
            # 课程
            course = tds[3].string
            print(course.ljust(20, ' ') + '\t\t\t', end='')
            j += 1
            table.write(i, j, course, set_style('Arial', 220))
            # 课程要求
            requirments = tds[4].string
            print(requirments.ljust(10, ' ') + '\t\t', end='')
            j += 1
            table.write(i, j, requirments, set_style('Arial', 220))
            # 学分
            scredit = tds[5].string
            print(scredit.ljust(2, ' ') + '\t\t', end='')
            j += 1
            table.write(i, j, scredit, set_style('Arial', 220))
            # 成绩
            achievement = tds[6].string
            print(achievement.ljust(2) + '\t\t', end='')
            j += 1
            table.write(i, j, achievement, set_style('Arial', 220))
            # 重考一
            reexaminef = tds[7].string
            print(reexaminef.ljust(2) + '\t\t', end='')
            j += 1
            table.write(i, j, reexaminef, set_style('Arial', 220))
            # 重考二
            reexamines = tds[8].string
            print(reexamines.ljust(2) + '\t\t')
            j += 1
            table.write(i, j, reexamines, set_style('Arial', 220))
            i += 1
    except:
        print('出了一点小Bug')
    file.save('demo.xls')

在模拟登陆操作后增加一个判断:

# 判断是否登陆
def isLogin(num):
    return_code = response.status_code
    if return_code == 200:
        if re.match(r"^\d{14}$", num):
            print('请稍等')
        else:
            print('请输入正确的学号')
        return True
    else:
        return False

最后在main中这么调用:

if name == 'main':
    num = input('请输入你的学号:')
    if isLogin(num):
        getScore(num, pagenum=0, i=0, j=0)
        getScore(num, pagenum=1, i=31, j=0)
        getScore(num, pagenum=2, i=62, j=0)

在PyCharm下按alt+shift+x快捷键运行程序:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

控制台输出

控制台会有如下输出(这里只截取部分,不要吐槽没有对齐,这里我也用了格式化输出还是不太行,不过最起码出来了结果,而且我们的目的是输出到Excel中不是吗)

Python crawler simulates logging into the Academic Affairs Office and saves data locally

控制台输出

然后去程序根目录找看看有没有生成一个叫demo.xls的文件,我的程序就放在桌面,所以去桌面找:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

桌面图标

点开查看是否成功获取:

Python crawler simulates logging into the Academic Affairs Office and saves data locally

最终获取结果

至此,大功告成

The above is the detailed content of Python crawler simulates logging into the Academic Affairs Office and saves data locally. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
详细讲解Python之Seaborn(数据可视化)详细讲解Python之Seaborn(数据可视化)Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

详细了解Python进程池与进程锁详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

Python自动化实践之筛选简历Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

归纳总结Python标准库归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于标准库总结的相关问题,下面一起来看一下,希望对大家有帮助。

分享10款高效的VSCode插件,总有一款能够惊艳到你!!分享10款高效的VSCode插件,总有一款能够惊艳到你!!Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

python中文是什么意思python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

Python数据类型详解之字符串、数字Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

详细介绍python的numpy模块详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment