Below I will share with you an example of using selenium to capture Taobao product information. It has a good reference value and I hope it will be helpful to everyone.
Taobao pages use a lot of js to load data, so it is easier to use selenium to crawl. As a testing tool, selenum is mainly used with the windowless browser phantomjs.
import re from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from pyquery import PyQuery as pq ''' wait.until()语句是selenum里面的显示等待,wait是一个WebDriverWait对象,它设置了等待时间,如果页面在等待时间内 没有在 DOM中找到元素,将继续等待,超出设定时间后则抛出找不到元素的异常,也可以说程序每隔xx秒看一眼,如果条件 成立了,则执行下一步,否则继续等待,直到超过设置的最长时间,然后抛出TimeoutException 1.presence_of_element_located 元素加载出,传入定位元组,如(By.ID, 'p') 2.element_to_be_clickable 元素可点击 3.text_to_be_present_in_element 某个元素文本包含某文字 ''' # 定义一个无界面的浏览器 browser = webdriver.PhantomJS( service_args=[ '--load-images=false', '--disk-cache=true']) # 10s无响应就down掉 wait = WebDriverWait(browser, 10) #虽然无界面但是必须要定义窗口 browser.set_window_size(1400, 900) def search(): ''' 此函数的作用为完成首页点击搜索的功能,替换标签可用于其他网页使用 :return: ''' print('正在搜索') try: #访问页面 browser.get('https://www.taobao.com') # 选择到淘宝首页的输入框 input = wait.until( EC.presence_of_element_located((By.CSS_SELECTOR, '#q')) ) #搜索的那个按钮 submit = wait.until(EC.element_to_be_clickable( (By.CSS_SELECTOR, '#J_TSearchForm > p.search-button > button'))) #send_key作为写到input的内容 input.send_keys('面条') #执行点击搜索的操作 submit.click() #查看到当前的页码一共是多少页 total = wait.until(EC.presence_of_element_located( (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > p.total'))) #获取所有的商品 get_products() #返回总页数 return total.text except TimeoutException: return search() def next_page(page_number): ''' 翻页函数, :param page_number: :return: ''' print('正在翻页', page_number) try: #这个是我们跳转页的输入框 input = wait.until(EC.presence_of_element_located( (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > p.form > input'))) #跳转时的确定按钮 submit = wait.until( EC.element_to_be_clickable( (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > p.form > span.J_Submit'))) #清除里面的数字 input.clear() #重新输入数字 input.send_keys(page_number) #选择并点击 submit.click() #判断当前页是不是我们要现实的页 wait.until( EC.text_to_be_present_in_element( (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > ul > li.item.active > span'), str(page_number))) #调用函数获取商品信息 get_products() #捕捉超时,重新进入翻页的函数 except TimeoutException: next_page(page_number) def get_products(): ''' 搜到页面信息在此函数在爬取我们需要的信息 :return: ''' #每一个商品标签,这里是加载出来以后才会拿网页源代码 wait.until(EC.presence_of_element_located( (By.CSS_SELECTOR, '#mainsrp-itemlist .items .item'))) #这里拿到的是整个网页源代码 html = browser.page_source #pq解析网页源代码 doc = pq(html) items = doc('#mainsrp-itemlist .items .item').items() for item in items: # print(item) product = { 'image': item.find('.pic .img').attr('src'), 'price': item.find('.price').text(), 'deal': item.find('.deal-cnt').text()[:-3], 'title': item.find('.title').text(), 'shop': item.find('.shop').text(), 'location': item.find('.location').text() } print(product) def main(): try: #第一步搜索 total = search() #int类型刚才找到的总页数标签,作为跳出循环的条件 total = int(re.compile('(\d+)').search(total).group(1)) #只要后面还有就继续爬,继续翻页 for i in range(2, total + 1): next_page(i) except Exception: print('出错啦') finally: #关闭浏览器 browser.close() if __name__ == '__main__': main()
The above is what I compiled for everyone. I hope it will be helpful to everyone in the future.
Related articles:
Implementing a magnifying glass through jquery technology
How to implement Baidu index crawler using Puppeteer image recognition technology
The above is the detailed content of Use selenium to capture Taobao data information. For more information, please follow other related articles on the PHP Chinese website!

JavaScript is widely used in websites, mobile applications, desktop applications and server-side programming. 1) In website development, JavaScript operates DOM together with HTML and CSS to achieve dynamic effects and supports frameworks such as jQuery and React. 2) Through ReactNative and Ionic, JavaScript is used to develop cross-platform mobile applications. 3) The Electron framework enables JavaScript to build desktop applications. 4) Node.js allows JavaScript to run on the server side and supports high concurrent requests.

Python is more suitable for data science and automation, while JavaScript is more suitable for front-end and full-stack development. 1. Python performs well in data science and machine learning, using libraries such as NumPy and Pandas for data processing and modeling. 2. Python is concise and efficient in automation and scripting. 3. JavaScript is indispensable in front-end development and is used to build dynamic web pages and single-page applications. 4. JavaScript plays a role in back-end development through Node.js and supports full-stack development.

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

Atom editor mac version download
The most popular open source editor

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software