Home > Article > Backend Development > Teach you step-by-step how to use Python web crawler to obtain the video selection content of Bilibili (source code attached)
When it comes to Bilibili, the first impression is the video. I believe that many friends, like me, want to use web crawler technology Get the videos from Station B, but the videos from Station B are actually not that easy to get. Guan is about getting the videos from Station B, which was previously available at The introduction is implemented through the you-get library. Interested friends can read this article: You-Get is so powerful! .
## Closer to home, friends who often study on Bilibili may often encounter bloggers who serialize dozens or even hundreds of videos, especially like this one For continuous tutorials on programming languages, courses, tool usage, etc., a selection series will appear, as shown in the figure below.
Of coursethe fields of theseselectionscan also be seen by our naked eyes. JustIf it is implemented through a program, it may not be as simple as imagined. So the goal of this article is to obtain video selections through Python web crawler technology and based on the selenium library.
The library we use in this article is selenium. This is a Although the library used to simulate user login feels slow, it is still used quite a lot in the field of web crawlers. It has been used repeatedly to simulate login and obtain data. Below is all the code to implement video selection collection. You are welcome to practice it yourself.
# coding: utf-8 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait class Item: page_num = "" part = "" duration = "" def __init__(self, page_num, part, duration): self.page_num = page_num self.part = part self.duration = duration def get_second(self): str_list = self.duration.split(":") sum = 0 for i, item in enumerate(str_list): sum += pow(60, len(str_list) - i - 1) * int(item) return sum def get_bilili_page_items(url): options = webdriver.ChromeOptions() options.add_argument('--headless') # 设置无界面 options.add_experimental_option('excludeSwitches', ['enable-automation']) # options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2, # "profile.managed_default_content_settings.flash": 0}) browser = webdriver.Chrome(options=options) # browser = webdriver.PhantomJS() print("正在打开网页...") browser.get(url) print("等待网页响应...") # 需要等一下,直到页面加载完成 wait = WebDriverWait(browser, 10) wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@class="list-box"]/li/a'))) print("正在获取网页数据...") list = browser.find_elements_by_xpath('//*[@class="list-box"]/li') # print(list) itemList = [] second_sum = 0 # 2.循环遍历出每一条搜索结果的标题 for t in list: # print("t text:",t.text) element = t.find_element_by_tag_name('a') # print("a text:",element.text) arr = element.text.split('\n') print(" ".join(arr)) item = Item(arr[0], arr[1], arr[2]) second_sum += item.get_second() itemList.append(item) print("总数量:", len(itemList)) # browser.page_source print("总时长/分钟:", round(second_sum / 60, 2)) print("总时长/小时:", round(second_sum / 3600.0, 2)) browser.close() return itemList get_bilili_page_items("https://www.bilibili.com/video/BV1Eb411u7Fw")
The selector used here is xpath, and the video example is the Tongji Edition of "Advanced Mathematics" at Station B Full teaching video (Teacher Song Hao) video selection. If you want to grab other video selections, you only need to change the URL link in the last line of the above code.
在运行过程中小伙伴们应该会经常遇到这个问题,如下图所示。
这个是因为谷歌驱动版本问题导致的,只需要根据提示,去下载对应的驱动版本即可,驱动下载链接:
https://chromedriver.storage.googleapis.com/index.html
The above is the detailed content of Teach you step-by-step how to use Python web crawler to obtain the video selection content of Bilibili (source code attached). For more information, please follow other related articles on the PHP Chinese website!