Home >Web Front-end >JS Tutorial >selenium captures Taobao product information

selenium captures Taobao product information

小云云Original: 2018-02-06 15:16:041482browse

Taobao pages use a lot of js to load data, so it is easier to use selenium to crawl. As a testing tool, selenum is mainly used with the windowless browser phantomjs. This article mainly shares an example of using selenium to capture Taobao product information. It has a good reference value and I hope it will be helpful to everyone. Let’s follow the editor to take a look, I hope it can help everyone.

import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
&#39;&#39;&#39;
wait.until()语句是selenum里面的显示等待，wait是一个WebDriverWait对象，它设置了等待时间，如果页面在等待时间内
没有在 DOM中找到元素，将继续等待，超出设定时间后则抛出找不到元素的异常,也可以说程序每隔xx秒看一眼，如果条件
成立了，则执行下一步，否则继续等待，直到超过设置的最长时间，然后抛出TimeoutException
1.presence_of_element_located 元素加载出，传入定位元组，如(By.ID, &#39;p&#39;)
2.element_to_be_clickable 元素可点击
3.text_to_be_present_in_element 某个元素文本包含某文字
&#39;&#39;&#39;
# 定义一个无界面的浏览器
browser = webdriver.PhantomJS(
 service_args=[
  &#39;--load-images=false&#39;,
  &#39;--disk-cache=true&#39;])
# 10s无响应就down掉
wait = WebDriverWait(browser, 10)
#虽然无界面但是必须要定义窗口
browser.set_window_size(1400, 900)

def search():
 &#39;&#39;&#39;
 此函数的作用为完成首页点击搜索的功能，替换标签可用于其他网页使用
 :return:
 &#39;&#39;&#39;
 print(&#39;正在搜索&#39;)
 try:
  #访问页面
  browser.get(&#39;https://www.taobao.com&#39;)
  # 选择到淘宝首页的输入框
  input = wait.until(
   EC.presence_of_element_located((By.CSS_SELECTOR, &#39;#q&#39;))
  )
  #搜索的那个按钮
  submit = wait.until(EC.element_to_be_clickable(
   (By.CSS_SELECTOR, &#39;#J_TSearchForm > p.search-button > button&#39;)))
  #send_key作为写到input的内容
  input.send_keys(&#39;面条&#39;)
  #执行点击搜索的操作
  submit.click()
  #查看到当前的页码一共是多少页
  total = wait.until(EC.presence_of_element_located(
   (By.CSS_SELECTOR, &#39;#mainsrp-pager > p > p > p > p.total&#39;)))
  #获取所有的商品
  get_products()
  #返回总页数
  return total.text
 except TimeoutException:
  return search()

def next_page(page_number):
 &#39;&#39;&#39;
 翻页函数，
 :param page_number:
 :return:
 &#39;&#39;&#39;
 print(&#39;正在翻页&#39;, page_number)
 try:
  #这个是我们跳转页的输入框
  input = wait.until(EC.presence_of_element_located(
   (By.CSS_SELECTOR, &#39;#mainsrp-pager > p > p > p > p.form > input&#39;)))
  #跳转时的确定按钮
  submit = wait.until(
   EC.element_to_be_clickable(
    (By.CSS_SELECTOR,
     &#39;#mainsrp-pager > p > p > p > p.form > span.J_Submit&#39;)))
  #清除里面的数字
  input.clear()
  #重新输入数字
  input.send_keys(page_number)
  #选择并点击
  submit.click()
  #判断当前页是不是我们要现实的页
  wait.until(
   EC.text_to_be_present_in_element(
    (By.CSS_SELECTOR,
     &#39;#mainsrp-pager > p > p > p > ul > li.item.active > span&#39;),
    str(page_number)))
  #调用函数获取商品信息
  get_products()
 #捕捉超时，重新进入翻页的函数
 except TimeoutException:
  next_page(page_number)

def get_products():
 &#39;&#39;&#39;
 搜到页面信息在此函数在爬取我们需要的信息
 :return:
 &#39;&#39;&#39;
 #每一个商品标签，这里是加载出来以后才会拿网页源代码
 wait.until(EC.presence_of_element_located(
  (By.CSS_SELECTOR, &#39;#mainsrp-itemlist .items .item&#39;)))
 #这里拿到的是整个网页源代码
 html = browser.page_source
 #pq解析网页源代码
 doc = pq(html)
 items = doc(&#39;#mainsrp-itemlist .items .item&#39;).items()
 for item in items:
  # print(item)
  product = {
   &#39;image&#39;: item.find(&#39;.pic .img&#39;).attr(&#39;src&#39;),
   &#39;price&#39;: item.find(&#39;.price&#39;).text(),
   &#39;deal&#39;: item.find(&#39;.deal-cnt&#39;).text()[:-3],
   &#39;title&#39;: item.find(&#39;.title&#39;).text(),
   &#39;shop&#39;: item.find(&#39;.shop&#39;).text(),
   &#39;location&#39;: item.find(&#39;.location&#39;).text()
  }
  print(product)

def main():
 try:
  #第一步搜索
  total = search()
  #int类型刚才找到的总页数标签，作为跳出循环的条件
  total = int(re.compile(&#39;(\d+)&#39;).search(total).group(1))
  #只要后面还有就继续爬，继续翻页
  for i in range(2, total + 1):
   next_page(i)
 except Exception:
  print(&#39;出错啦&#39;)
 finally:
  #关闭浏览器
  browser.close()

if __name__ == &#39;__main__&#39;:
 main()

Related recommendations:

How to use selenium to take screenshots to generate images

Selenium's example code for automatic login

Detailed explanation of setting proxy ip method in selenium

The above is the detailed content of selenium captures Taobao product information. For more information, please follow other related articles on the PHP Chinese website!

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Sharing of JS command pattern concepts and usageNext article：Sharing of JS command pattern concepts and usage

See more

selenium captures Taobao product information

Related articles