Home >Backend Development >Python Tutorial >Detailed explanation of python3 Baidu index crawling example
This article mainly introduces python3 Baidu index crawling. The editor thinks it is quite good. Now I will share it with you and give it as a reference. Let’s follow the editor to have a look.
Catch the Baidu index, and then use image recognition to get the index
Tufu once said that the Baidu index is difficult to grasp. On Taobao, it costs 20 yuan per unit. Keywords:
How could someone be frightened by someone with such a big mouth, so it took me about 2 and a half days to complete it. I despise Tufu
There are many installed libraries:
谷歌图像识别tesseract-ocr pip3 install pillow pip3 install pyocr selenium2.45 Chrome47.0.2526.106 m or Firebox32.0.1 chromedriver.exe
You need to log in to enter the Baidu Index. The login account and password are written in the text account:
The universal login code is as follows:
# 打开浏览器 def openbrowser(): global browser # http://www.php.cn/ url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" # 打开谷歌浏览器 # Firefox() # Chrome() browser = webdriver.Chrome() # 输入网址 browser.get(url) # 打开浏览器时间 # print("等待10秒打开浏览器...") # time.sleep(10) # 找到id="TANGRAM__PSP_3__userName"的对话框 # 清空输入框 browser.find_element_by_id("TANGRAM__PSP_3__userName").clear() browser.find_element_by_id("TANGRAM__PSP_3__password").clear() # 输入账号密码 # 输入账号密码 account = [] try: fileaccount = open("../baidu/account.txt") accounts = fileaccount.readlines() for acc in accounts: account.append(acc.strip()) fileaccount.close() except Exception as err: print(err) input("请正确在account.txt里面写入账号密码") exit() browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0]) browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1]) # 点击登陆登陆 # id="TANGRAM__PSP_3__submit" browser.find_element_by_id("TANGRAM__PSP_3__submit").click() # 等待登陆10秒 # print('等待登陆10秒...') # time.sleep(10) print("等待网址加载完毕...") select = input("请观察浏览器网站是否已经登陆(y/n):") while 1: if select == "y" or select == "Y": print("登陆成功!") print("准备打开新的窗口...") # time.sleep(1) # browser.quit() break elif select == "n" or select == "N": selectno = input("账号密码错误请按0,验证码出现请按1...") # 账号密码错误则重新输入 if selectno == "0": # 找到id="TANGRAM__PSP_3__userName"的对话框 # 清空输入框 browser.find_element_by_id("TANGRAM__PSP_3__userName").clear() browser.find_element_by_id("TANGRAM__PSP_3__password").clear() # 输入账号密码 account = [] try: fileaccount = open("../baidu/account.txt") accounts = fileaccount.readlines() for acc in accounts: account.append(acc.strip()) fileaccount.close() except Exception as err: print(err) input("请正确在account.txt里面写入账号密码") exit() browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0]) browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1]) # 点击登陆sign in # id="TANGRAM__PSP_3__submit" browser.find_element_by_id("TANGRAM__PSP_3__submit").click() elif selectno == "1": # 验证码的id为id="ap_captcha_guess"的对话框 input("请在浏览器中输入验证码并登陆...") select = input("请观察浏览器网站是否已经登陆(y/n):") else: print("请输入“y”或者“n”!") select = input("请观察浏览器网站是否已经登陆(y/n):")
Login page:
After logging in, you need to open a new window, that is, open the Baidu Index and switch windows. Use selenium:
# 新开一个窗口,通过执行js来新开一个窗口 js = 'window.open("http://index.baidu.com");' browser.execute_script(js) # 新窗口句柄切换,进入百度指数 # 获得当前打开所有窗口的句柄handles # handles为一个数组 handles = browser.window_handles # print(handles) # 切换到当前最新打开的窗口 browser.switch_to_window(handles[-1])
Clear the input box and construct the number of click days:
# 清空输入框 browser.find_element_by_id("schword").clear() # 写入需要搜索的百度指数 browser.find_element_by_id("schword").send_keys(keyword) # 点击搜索 # <input type="submit" value="" id="searchWords" onclick="searchDemoWords()"> browser.find_element_by_id("searchWords").click() time.sleep(2) # 最大化窗口 browser.maximize_window() # 构造天数 sel = int(input("查询7天请按0,30天请按1,90天请按2,半年请按3:")) day = 0 if sel == 0: day = 7 elif sel == 1: day = 30 elif sel == 2: day = 90 elif sel == 3: day = 180 sel = '//a[@rel="' + str(day) + '"]' browser.find_element_by_xpath(sel).click() # 太快了 time.sleep(2)
The number of days is here:
Find the graphics box:
xoyelement = browser.find_elements_by_css_selector("#trend rect")[2]
The graphics box is:
Construct offsets based on different coordinate points:
Select the coordinates of 7 days to observe:
The abscissa of one point is 1031.66666
The abscissa of the second point is 1234
from selenium.webdriver.common.action_chains import ActionChains ActionChains(browser).move_to_element_with_offset(xoyelement,x_0,y_0).perform()
But this is sure The point is pointed out at this position:
, which is the upper left corner of the rectangle. The js will not be loaded here to display the pop-up box, so the abscissa + 1:
x_0 = 1 y_0 = 0
Write a cycle based on the number of days and let the abscissa accumulate:
# 按照选择的天数循环 for i in range(day): # 构造规则 if day == 7: x_0 = x_0 + 202.33 elif day == 30: x_0 = x_0 + 41.68 elif day == 90: x_0 = x_0 + 13.64 elif day == 180: x_0 = x_0 + 6.78
A box will pop up when the mouse moves horizontally. Find this box in the URL:
Selenium automatically recognizes...:
# <p class="imgtxt" style="margin-left:-117px;"></p> imgelement = browser.find_element_by_xpath('//p[@id="viewbox"]')
And determine the size and position of this box:
# 找到图片坐标 locations = imgelement.location print(locations) # 找到图片大小 sizes = imgelement.size print(sizes) # 构造指数的位置 rangle = (int(locations['x']), int(locations['y']), int(locations['x'] + sizes['width']), int(locations['y'] + sizes['height']))
The intercepted graphic is:
The following idea is:
1. Take a screenshot of the entire screen
2. Open the screenshot and use the coordinate range obtained above to crop it
But the final crop is the black frame above. The effect I want is:
So I need to calculate the range, but I am lazy and ignore the length of the search term, so I directly write violently:
# 构造指数的位置 rangle = (int(locations['x'] + sizes['width']/3), int(locations['y'] + sizes['height']/2), int(locations['x'] + sizes['width']*2/3), int(locations['y'] + sizes['height']))
# <p class="imgtxt" style="margin-left:-117px;"></p> imgelement = browser.find_element_by_xpath('//p[@id="viewbox"]') # 找到图片坐标 locations = imgelement.location print(locations) # 找到图片大小 sizes = imgelement.size print(sizes) # 构造指数的位置 rangle = (int(locations['x'] + sizes['width']/3), int(locations['y'] + sizes['height']/2), int(locations['x'] + sizes['width']*2/3), int(locations['y'] + sizes['height'])) # 截取当前浏览器 path = "../baidu/" + str(num) browser.save_screenshot(str(path) + ".png") # 打开截图切割 img = Image.open(str(path) + ".png") jpg = img.crop(rangle) jpg.save(str(path) + ".jpg")
# 将图片放大一倍 # 原图大小73.29 jpgzoom = Image.open(str(path) + ".jpg") (x, y) = jpgzoom.size x_s = 146 y_s = 58 out = jpgzoom.resize((x_s, y_s), Image.ANTIALIAS) out.save(path + 'zoom.jpg', 'png', quality=95)
原图大小请 右键->属性->详细信息 查看,我的是长73像素,宽29像素
# 图像识别 index = [] image = Image.open(str(path) + "zoom.jpg") code = pytesseract.image_to_string(image) if code: index.append(code)
更多Detailed explanation of python3 Baidu index crawling example相关文章请关注PHP中文网!