Detailed explanation of python3 Baidu index crawling example-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of python3 Baidu index crawling example

高洛峰

Feb 14, 2017 pm 01:42 PM

This article mainly introduces python3 Baidu index crawling. The editor thinks it is quite good. Now I will share it with you and give it as a reference. Let’s follow the editor to have a look.

Catch the Baidu index, and then use image recognition to get the index

Foreword:

Tufu once said that the Baidu index is difficult to grasp. On Taobao, it costs 20 yuan per unit. Keywords:

Detailed explanation of python3 Baidu index crawling example

How could someone be frightened by someone with such a big mouth, so it took me about 2 and a half days to complete it. I despise Tufu

There are many installed libraries:

谷歌图像识别tesseract-ocr

pip3 install pillow

pip3 install pyocr

selenium2.45

Chrome47.0.2526.106 m or Firebox32.0.1

chromedriver.exe

You need to log in to enter the Baidu Index. The login account and password are written in the text account:

Detailed explanation of python3 Baidu index crawling example

The universal login code is as follows:

# 打开浏览器
def openbrowser():
  global browser

  # http://www.php.cn/
  url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F"
  # 打开谷歌浏览器
  # Firefox()
  # Chrome()
  browser = webdriver.Chrome()
  # 输入网址
  browser.get(url)
  # 打开浏览器时间
  # print("等待10秒打开浏览器...")
  # time.sleep(10)

  # 找到id="TANGRAM__PSP_3__userName"的对话框
  # 清空输入框
  browser.find_element_by_id("TANGRAM__PSP_3__userName").clear()
  browser.find_element_by_id("TANGRAM__PSP_3__password").clear()

  # 输入账号密码
  # 输入账号密码
  account = []
  try:
    fileaccount = open("../baidu/account.txt")
    accounts = fileaccount.readlines()
    for acc in accounts:
      account.append(acc.strip())
    fileaccount.close()
  except Exception as err:
    print(err)
    input("请正确在account.txt里面写入账号密码")
    exit()
  browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0])
  browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1])

  # 点击登陆登陆
  # id="TANGRAM__PSP_3__submit"
  browser.find_element_by_id("TANGRAM__PSP_3__submit").click()

  # 等待登陆10秒
  # print(&#39;等待登陆10秒...&#39;)
  # time.sleep(10)
  print("等待网址加载完毕...")

  select = input("请观察浏览器网站是否已经登陆(y/n)：")
  while 1:
    if select == "y" or select == "Y":
      print("登陆成功！")
      print("准备打开新的窗口...")
      # time.sleep(1)
      # browser.quit()
      break

    elif select == "n" or select == "N":
      selectno = input("账号密码错误请按0，验证码出现请按1...")
      # 账号密码错误则重新输入
      if selectno == "0":

        # 找到id="TANGRAM__PSP_3__userName"的对话框
        # 清空输入框
        browser.find_element_by_id("TANGRAM__PSP_3__userName").clear()
        browser.find_element_by_id("TANGRAM__PSP_3__password").clear()

        # 输入账号密码
        account = []
        try:
          fileaccount = open("../baidu/account.txt")
          accounts = fileaccount.readlines()
          for acc in accounts:
            account.append(acc.strip())
          fileaccount.close()
        except Exception as err:
          print(err)
          input("请正确在account.txt里面写入账号密码")
          exit()

        browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0])
        browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1])
        # 点击登陆sign in
        # id="TANGRAM__PSP_3__submit"
        browser.find_element_by_id("TANGRAM__PSP_3__submit").click()

      elif selectno == "1":
        # 验证码的id为id="ap_captcha_guess"的对话框
        input("请在浏览器中输入验证码并登陆...")
        select = input("请观察浏览器网站是否已经登陆(y/n)：")

    else:
      print("请输入“y”或者“n”！")
      select = input("请观察浏览器网站是否已经登陆(y/n)：")

Detailed explanation of python3 Baidu index crawling example

After logging in, you need to open a new window, that is, open the Baidu Index and switch windows. Use selenium:

# 新开一个窗口，通过执行js来新开一个窗口
js = &#39;window.open("http://index.baidu.com");&#39;
browser.execute_script(js)
# 新窗口句柄切换，进入百度指数
# 获得当前打开所有窗口的句柄handles
# handles为一个数组
handles = browser.window_handles
# print(handles)
# 切换到当前最新打开的窗口
browser.switch_to_window(handles[-1])

Clear the input box and construct the number of click days:

# 清空输入框
browser.find_element_by_id("schword").clear()
# 写入需要搜索的百度指数
browser.find_element_by_id("schword").send_keys(keyword)
# 点击搜索
# <input type="submit" value="" id="searchWords" onclick="searchDemoWords()">
browser.find_element_by_id("searchWords").click()
time.sleep(2)
# 最大化窗口
browser.maximize_window()
# 构造天数
sel = int(input("查询7天请按0，30天请按1，90天请按2，半年请按3："))
day = 0
if sel == 0:
  day = 7
elif sel == 1:
  day = 30
elif sel == 2:
  day = 90
elif sel == 3:
  day = 180
sel = &#39;//a[@rel="&#39; + str(day) + &#39;"]&#39;
browser.find_element_by_xpath(sel).click()
# 太快了
time.sleep(2)

The number of days is here:

Detailed explanation of python3 Baidu index crawling example

Find the graphics box:

xoyelement = browser.find_elements_by_css_selector("#trend rect")[2]

The graphics box is:

Detailed explanation of python3 Baidu index crawling example

Construct offsets based on different coordinate points:

Detailed explanation of python3 Baidu index crawling example

Select the coordinates of 7 days to observe:

The abscissa of one point is 1031.66666

The abscissa of the second point is 1234

Detailed explanation of python3 Baidu index crawling example

##So the distance between the two coordinates is 7 days The difference is: 202.33, other days are similar

Use selenium library to simulate mouse sliding suspension:

from selenium.webdriver.common.action_chains import ActionChains
ActionChains(browser).move_to_element_with_offset(xoyelement,x_0,y_0).perform()

But this is sure The point is pointed out at this position:

Detailed explanation of python3 Baidu index crawling example , which is the upper left corner of the rectangle. The js will not be loaded here to display the pop-up box, so the abscissa + 1:

x_0 = 1
y_0 = 0

Write a cycle based on the number of days and let the abscissa accumulate:

# 按照选择的天数循环
for i in range(day):
  # 构造规则
  if day == 7:
    x_0 = x_0 + 202.33
  elif day == 30:
    x_0 = x_0 + 41.68
  elif day == 90:
    x_0 = x_0 + 13.64
  elif day == 180:
    x_0 = x_0 + 6.78

A box will pop up when the mouse moves horizontally. Find this box in the URL:

Detailed explanation of python3 Baidu index crawling example Selenium automatically recognizes...:

# <p class="imgtxt" style="margin-left:-117px;"></p>
imgelement = browser.find_element_by_xpath(&#39;//p[@id="viewbox"]&#39;)

And determine the size and position of this box:

# 找到图片坐标
locations = imgelement.location
print(locations)
# 找到图片大小
sizes = imgelement.size
print(sizes)
# 构造指数的位置
rangle = (int(locations[&#39;x&#39;]), int(locations[&#39;y&#39;]), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))

The intercepted graphic is:

Detailed explanation of python3 Baidu index crawling example
The following idea is:

1. Take a screenshot of the entire screen

2. Open the screenshot and use the coordinate range obtained above to crop it

But the final crop is the black frame above. The effect I want is:

Detailed explanation of python3 Baidu index crawling example So I need to calculate the range, but I am lazy and ignore the length of the search term, so I directly write violently:

# 构造指数的位置
rangle = (int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]/3), int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]/2), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]*2/3),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))

这个写法最终不太好，最起码要对keyword的长度进行判断，长度过长会导致截图坐标出现偏差，反正我知道怎么做，就是不写出来给你们看！

后面的完整代码是：

# <p class="imgtxt" style="margin-left:-117px;"></p>
imgelement = browser.find_element_by_xpath(&#39;//p[@id="viewbox"]&#39;)
# 找到图片坐标
locations = imgelement.location
print(locations)
# 找到图片大小
sizes = imgelement.size
print(sizes)
# 构造指数的位置
rangle = (int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]/3), int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]/2), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]*2/3),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))
# 截取当前浏览器
path = "../baidu/" + str(num)
browser.save_screenshot(str(path) + ".png")
# 打开截图切割
img = Image.open(str(path) + ".png")
jpg = img.crop(rangle)
jpg.save(str(path) + ".jpg")

但是后面发现裁剪的图片太小，识别精度太低，所以需要对图片进行扩大：

# 将图片放大一倍
# 原图大小73.29
jpgzoom = Image.open(str(path) + ".jpg")
(x, y) = jpgzoom.size
x_s = 146
y_s = 58
out = jpgzoom.resize((x_s, y_s), Image.ANTIALIAS)
out.save(path + &#39;zoom.jpg&#39;, &#39;png&#39;, quality=95)

原图大小请右键->属性->详细信息查看，我的是长73像素，宽29像素

最后就是图像识别

# 图像识别
index = []
image = Image.open(str(path) + "zoom.jpg")
code = pytesseract.image_to_string(image)
if code:
  index.append(code)

最后效果图：

Detailed explanation of python3 Baidu index crawling example

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持PHP中文网。

更多Detailed explanation of python3 Baidu index crawling example相关文章请关注PHP中文网！

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How to Use Python to Find the Zipf Distribution of a Text FileMar 05, 2025 am 09:58 AM

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

How to Download Files in PythonMar 01, 2025 am 10:03 AM

Python provides a variety of ways to download files from the Internet, which can be downloaded over HTTP using the urllib package or the requests library. This tutorial will explain how to use these libraries to download files from URLs from Python. requests library requests is one of the most popular libraries in Python. It allows sending HTTP/1.1 requests without manually adding query strings to URLs or form encoding of POST data. The requests library can perform many functions, including: Add form data Add multi-part file Access Python response data Make a request head

How Do I Use Beautiful Soup to Parse HTML?Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

Image Filtering in PythonMar 03, 2025 am 09:44 AM

Dealing with noisy images is a common problem, especially with mobile phone or low-resolution camera photos. This tutorial explores image filtering techniques in Python using OpenCV to tackle this issue. Image Filtering: A Powerful Tool Image filter

How to Work With PDF Documents Using PythonMar 02, 2025 am 09:54 AM

PDF files are popular for their cross-platform compatibility, with content and layout consistent across operating systems, reading devices and software. However, unlike Python processing plain text files, PDF files are binary files with more complex structures and contain elements such as fonts, colors, and images. Fortunately, it is not difficult to process PDF files with Python's external modules. This article will use the PyPDF2 module to demonstrate how to open a PDF file, print a page, and extract text. For the creation and editing of PDF files, please refer to another tutorial from me. Preparation The core lies in using external module PyPDF2. First, install it using pip: pip is P

How to Cache Using Redis in Django ApplicationsMar 02, 2025 am 10:10 AM

This tutorial demonstrates how to leverage Redis caching to boost the performance of Python applications, specifically within a Django framework. We'll cover Redis installation, Django configuration, and performance comparisons to highlight the bene

Introducing the Natural Language Toolkit (NLTK)Mar 01, 2025 am 10:05 AM

Natural language processing (NLP) is the automatic or semi-automatic processing of human language. NLP is closely related to linguistics and has links to research in cognitive science, psychology, physiology, and mathematics. In the computer science

How to Perform Deep Learning with TensorFlow or PyTorch?Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks agoByDDD

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

SublimeText3 Mac version

God-level code editing software (SublimeText3)

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6

Visual web development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7317

1625

1349

1261

1209