Target.com 是美国最大的电子商务和购物市场之一。它允许消费者在网上和店内购买从杂货和必需品到服装和电子产品的所有商品。截至2024年9月,根据SimilarWeb的数据,Target.com每月吸引的网络流量超过1.66亿。
Target.com 网站提供客户评论、动态定价信息、产品比较和产品评级等。对于想要跟踪产品趋势、监控竞争对手价格或通过评论分析客户情绪的分析师、营销团队、企业或研究人员来说,它是宝贵的数据来源。
在本文中,您将学习如何:
在本文结束时,您将了解如何使用 Python、Selenium 和 ScraperAPI 从 Target.com 收集产品评论和评级而不被屏蔽。您还将学习如何使用抓取的数据进行情感分析。
如果您在我编写本教程时感到兴奋,那么让我们直接开始吧。?
对于那些赶时间的人,这是我们将在本教程的基础上构建的完整代码片段:
import os import time from bs4 import BeautifulSoup from dotenv import load_dotenv from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Load environment variables load_dotenv() def target_com_scraper(): """ SCRAPER SETTINGS - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit """ API_KEY = os.getenv("API_KEY", "yourapikey") # ScraperAPI proxy settings (with HTTP and HTTPS variants) scraper_api_proxies = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' } } # URLs to scrape url_list = [ "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab", ] # Store scraped data scraped_data = [] # Setup Selenium options with proxy options = Options() # options.add_argument("--headless") # Uncomment for headless mode options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-extensions") options.add_argument("--disable-in-process-stack-traces") options.add_argument("--window-size=1920,1080") options.add_argument("--log-level=3") options.add_argument("--disable-logging") options.add_argument("--start-maximized") # Initialize Selenium WebDriver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies) def scroll_down_page(distance=100, delay=0.2): """ Scroll down the page gradually until the end. Args: - distance: Number of pixels to scroll by in each step. - delay: Time (in seconds) to wait between scrolls. """ total_height = driver.execute_script( "return document.body.scrollHeight") scrolled_height = 0 while scrolled_height < total_height: # Scroll down by 'distance' pixels driver.execute_script(f"window.scrollBy(0, {distance});") scrolled_height += distance time.sleep(delay) # Pause between scrolls # Update the total page height after scrolling total_height = driver.execute_script( "return document.body.scrollHeight") print("Finished scrolling.") try: for url in url_list: # Use Selenium to load the page driver.get(url) time.sleep(5) # Give the page time to load # Scroll down the page scroll_down_page() # Extract single elements with Selenium def extract_element_text(selector, description): try: # Wait for the element and extract text element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located( (By.CSS_SELECTOR, selector)) ) text = element.text.strip() return text if text else None # Return None if the text is empty except TimeoutException: print(f"Timeout: Could not find {description}. Setting to None.") return None except NoSuchElementException: print(f"Element not found: {description}. Setting to None.") return None # Extract single elements reviews_data = {} reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']", "secondary_rating") reviews_data["rating_count"] = extract_element_text( "div[data-test='rating-count']", "rating_count") reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']", "rating_histogram") reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']", "percent_recommended") reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']", "total_recommendations") # Extract reviews from 'reviews-list' scraped_reviews = [] # Use Beautiful Soup to extract other content soup = BeautifulSoup(driver.page_source, 'html.parser') # Select all reviews in the list using BeautifulSoup reviews_list = soup.select("div[data-test='reviews-list'] > div") for review in reviews_list: # Create a dictionary to store each review's data ratings = {} # Extract title title_element = review.select_one( "h4[data-test='review-card--title']") ratings['title'] = title_element.text.strip( ) if title_element else None # Extract rating rating_element = review.select_one("span[data-test='ratings']") ratings['rating'] = rating_element.text.strip( ) if rating_element else None # Extract time time_element = review.select_one( "span[data-test='review-card--reviewTime']") ratings['time'] = time_element.text.strip( ) if time_element else None # Extract review text text_element = review.select_one( "div[data-test='review-card--text']") ratings['text'] = text_element.text.strip( ) if text_element else None # Append each review to the list of reviews scraped_reviews.append(ratings) # Append the list of reviews to the main product data reviews_data["reviews"] = scraped_reviews # Append the overall data to the scraped_data list scraped_data.append(reviews_data) # Output the scraped data print(f"Scraped data: {scraped_data}") except Exception as e: print(f"Error: {e}") finally: # Ensure driver quits after scraping driver.quit() if __name__ == "__main__": target_com_scraper()
查看 GitHub 上的完整代码:https://github.com/Eunit99/target_com_scraper。想理解每一行代码吗?让我们一起从头开始构建网络抓取工具!
在之前的文章中,我们介绍了抓取 Target.com 产品数据所需了解的所有内容。不过,在本文中,我将重点介绍如何使用 Python 和 ScraperAPI 抓取 Target.com 的产品评级和评论。
要遵循本教程并开始抓取 Target.com,您需要首先执行一些操作。
从 ScraperAPI 上的免费帐户开始。 ScraperAPI 允许您使用我们易于使用的 Web 抓取 API 开始从数百万个 Web 源收集数据,而无需复杂且昂贵的解决方法。
ScraperAPI 甚至可以解锁最难的网站,降低基础设施和开发成本,让您更快地部署网络抓取工具,并且还为您提供 1,000 个免费 API 积分以供您首先尝试,等等。
使用代码编辑器,例如Visual Studio Code。其他选项包括 Sublime Text 或 PyCharm。
开始抓取 Target.com 评论之前,请确保您具备以下条件:
最佳实践是为 Python 项目使用虚拟环境来管理依赖关系并避免冲突。
要创建虚拟环境,请在终端中运行以下命令:
import os import time from bs4 import BeautifulSoup from dotenv import load_dotenv from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Load environment variables load_dotenv() def target_com_scraper(): """ SCRAPER SETTINGS - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit """ API_KEY = os.getenv("API_KEY", "yourapikey") # ScraperAPI proxy settings (with HTTP and HTTPS variants) scraper_api_proxies = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' } } # URLs to scrape url_list = [ "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab", ] # Store scraped data scraped_data = [] # Setup Selenium options with proxy options = Options() # options.add_argument("--headless") # Uncomment for headless mode options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-extensions") options.add_argument("--disable-in-process-stack-traces") options.add_argument("--window-size=1920,1080") options.add_argument("--log-level=3") options.add_argument("--disable-logging") options.add_argument("--start-maximized") # Initialize Selenium WebDriver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies) def scroll_down_page(distance=100, delay=0.2): """ Scroll down the page gradually until the end. Args: - distance: Number of pixels to scroll by in each step. - delay: Time (in seconds) to wait between scrolls. """ total_height = driver.execute_script( "return document.body.scrollHeight") scrolled_height = 0 while scrolled_height < total_height: # Scroll down by 'distance' pixels driver.execute_script(f"window.scrollBy(0, {distance});") scrolled_height += distance time.sleep(delay) # Pause between scrolls # Update the total page height after scrolling total_height = driver.execute_script( "return document.body.scrollHeight") print("Finished scrolling.") try: for url in url_list: # Use Selenium to load the page driver.get(url) time.sleep(5) # Give the page time to load # Scroll down the page scroll_down_page() # Extract single elements with Selenium def extract_element_text(selector, description): try: # Wait for the element and extract text element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located( (By.CSS_SELECTOR, selector)) ) text = element.text.strip() return text if text else None # Return None if the text is empty except TimeoutException: print(f"Timeout: Could not find {description}. Setting to None.") return None except NoSuchElementException: print(f"Element not found: {description}. Setting to None.") return None # Extract single elements reviews_data = {} reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']", "secondary_rating") reviews_data["rating_count"] = extract_element_text( "div[data-test='rating-count']", "rating_count") reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']", "rating_histogram") reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']", "percent_recommended") reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']", "total_recommendations") # Extract reviews from 'reviews-list' scraped_reviews = [] # Use Beautiful Soup to extract other content soup = BeautifulSoup(driver.page_source, 'html.parser') # Select all reviews in the list using BeautifulSoup reviews_list = soup.select("div[data-test='reviews-list'] > div") for review in reviews_list: # Create a dictionary to store each review's data ratings = {} # Extract title title_element = review.select_one( "h4[data-test='review-card--title']") ratings['title'] = title_element.text.strip( ) if title_element else None # Extract rating rating_element = review.select_one("span[data-test='ratings']") ratings['rating'] = rating_element.text.strip( ) if rating_element else None # Extract time time_element = review.select_one( "span[data-test='review-card--reviewTime']") ratings['time'] = time_element.text.strip( ) if time_element else None # Extract review text text_element = review.select_one( "div[data-test='review-card--text']") ratings['text'] = text_element.text.strip( ) if text_element else None # Append each review to the list of reviews scraped_reviews.append(ratings) # Append the list of reviews to the main product data reviews_data["reviews"] = scraped_reviews # Append the overall data to the scraped_data list scraped_data.append(reviews_data) # Output the scraped data print(f"Scraped data: {scraped_data}") except Exception as e: print(f"Error: {e}") finally: # Ensure driver quits after scraping driver.quit() if __name__ == "__main__": target_com_scraper()
根据您的操作系统激活虚拟环境:
python3 -m venv env
有些IDE可以自动激活虚拟环境。
为了有效地理解本文,必须对 CSS 选择器有基本的了解。 CSS 选择器用于定位网页上的特定 HTML 元素,它允许您提取所需的信息。
此外,熟悉浏览器开发工具对于检查和识别网页结构至关重要。
满足上述先决条件后,就可以开始设置您的项目了。首先创建一个包含 Target.com 抓取工具源代码的文件夹。在这种情况下,我将我的文件夹命名为 python-target-dot-com-scraper。
运行以下命令创建名为 python-target-dot-com-scraper 的文件夹:
# On Unix or MacOS (bash shell): /path/to/venv/bin/activate # On Unix or MacOS (csh shell): /path/to/venv/bin/activate.csh # On Unix or MacOS (fish shell): /path/to/venv/bin/activate.fish # On Windows (command prompt): \path\to\venv\Scripts\activate.bat # On Windows (PowerShell): \path\to\venv\Scripts\Activate.ps1
进入文件夹并通过运行以下命令创建一个新的 Python main.py 文件:
mkdir python-target-dot-com-scraper
通过运行以下命令创建requirements.txt 文件:
cd python-target-dot-com-scraper && touch main.py
在本文中,我将使用 Selenium 和 Beautiful Soup 以及 Python 库的 Webdriver Manager 来构建网络抓取工具。 Selenium 将处理浏览器自动化,Beautiful Soup 库将从 Target.com 网站的 HTML 内容中提取数据。同时,Python 的 Webdriver Manager 提供了一种自动管理不同浏览器驱动程序的方法。
将以下行添加到您的requirements.txt 文件中以指定必要的包:
touch requirements.txt
要安装软件包,请运行以下命令:
selenium~=4.25.0 bs4~=0.0.2 python-dotenv~=1.0.1 webdriver_manager selenium-wire blinker==1.7.0 python-dotenv==1.0.1
在本节中,我将引导您逐步了解如何从 Target.com 的产品页面(例如 Target.com)获取产品评级和评论。
我将重点关注以下屏幕截图中突出显示的网站这些部分的评论和评级:
在进一步深入研究之前,您需要了解 HTML 结构并识别与包装我们要提取的信息的 HTML 标签关联的 DOM 选择器。在下一节中,我将引导您使用 Chrome DevTools 来了解 Target.com 的网站结构。
按 F12 或右键单击页面上的任意位置并选择“检查”,打开 Chrome DevTools。从上面的 URL 检查页面会发现以下内容:
从上面的图片中,以下是网络抓取工具将用于提取信息的所有 DOM 选择器:
Information | DOM selector | Value |
---|---|---|
Product ratings | ||
Rating value | div[data-test='rating-value'] | 4.7 |
Rating count | div[data-test='rating-count'] | 683 star ratings |
Secondary rating | div[data-test='secondary-rating'] | 683 star ratings |
Rating histogram | div[data-test='rating-histogram'] | 5 stars 85%4 stars 8%3 stars 3%2 stars 1%1 star 2% |
Percent recommended | div[data-test='percent-recommended'] | 89% would recommend |
Total recommendations | div[data-test='total-recommendations'] | 125 recommendations |
Product reviews | ||
Reviews list | div[data-test='reviews-list'] | Returns children elements corresponding to individual product review |
Review card title | h4[data-test='review-card--title'] | Perfect litter box for cats |
Ratings | span[data-test='ratings'] | 4.7 out of 5 stars with 683 reviews |
Review time | span[data-test='review-card--reviewTime'] | 23 days ago |
Review card text | div[data-test='review-card--text'] | My cats love it. Doesn't take up much space either |
现在我们已经概述了所有要求,并在 Target.com 产品评论页面上找到了我们感兴趣的不同元素。我们将进入下一步,导入必要的模块:
import os import time from bs4 import BeautifulSoup from dotenv import load_dotenv from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Load environment variables load_dotenv() def target_com_scraper(): """ SCRAPER SETTINGS - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit """ API_KEY = os.getenv("API_KEY", "yourapikey") # ScraperAPI proxy settings (with HTTP and HTTPS variants) scraper_api_proxies = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' } } # URLs to scrape url_list = [ "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab", ] # Store scraped data scraped_data = [] # Setup Selenium options with proxy options = Options() # options.add_argument("--headless") # Uncomment for headless mode options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-extensions") options.add_argument("--disable-in-process-stack-traces") options.add_argument("--window-size=1920,1080") options.add_argument("--log-level=3") options.add_argument("--disable-logging") options.add_argument("--start-maximized") # Initialize Selenium WebDriver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies) def scroll_down_page(distance=100, delay=0.2): """ Scroll down the page gradually until the end. Args: - distance: Number of pixels to scroll by in each step. - delay: Time (in seconds) to wait between scrolls. """ total_height = driver.execute_script( "return document.body.scrollHeight") scrolled_height = 0 while scrolled_height < total_height: # Scroll down by 'distance' pixels driver.execute_script(f"window.scrollBy(0, {distance});") scrolled_height += distance time.sleep(delay) # Pause between scrolls # Update the total page height after scrolling total_height = driver.execute_script( "return document.body.scrollHeight") print("Finished scrolling.") try: for url in url_list: # Use Selenium to load the page driver.get(url) time.sleep(5) # Give the page time to load # Scroll down the page scroll_down_page() # Extract single elements with Selenium def extract_element_text(selector, description): try: # Wait for the element and extract text element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located( (By.CSS_SELECTOR, selector)) ) text = element.text.strip() return text if text else None # Return None if the text is empty except TimeoutException: print(f"Timeout: Could not find {description}. Setting to None.") return None except NoSuchElementException: print(f"Element not found: {description}. Setting to None.") return None # Extract single elements reviews_data = {} reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']", "secondary_rating") reviews_data["rating_count"] = extract_element_text( "div[data-test='rating-count']", "rating_count") reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']", "rating_histogram") reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']", "percent_recommended") reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']", "total_recommendations") # Extract reviews from 'reviews-list' scraped_reviews = [] # Use Beautiful Soup to extract other content soup = BeautifulSoup(driver.page_source, 'html.parser') # Select all reviews in the list using BeautifulSoup reviews_list = soup.select("div[data-test='reviews-list'] > div") for review in reviews_list: # Create a dictionary to store each review's data ratings = {} # Extract title title_element = review.select_one( "h4[data-test='review-card--title']") ratings['title'] = title_element.text.strip( ) if title_element else None # Extract rating rating_element = review.select_one("span[data-test='ratings']") ratings['rating'] = rating_element.text.strip( ) if rating_element else None # Extract time time_element = review.select_one( "span[data-test='review-card--reviewTime']") ratings['time'] = time_element.text.strip( ) if time_element else None # Extract review text text_element = review.select_one( "div[data-test='review-card--text']") ratings['text'] = text_element.text.strip( ) if text_element else None # Append each review to the list of reviews scraped_reviews.append(ratings) # Append the list of reviews to the main product data reviews_data["reviews"] = scraped_reviews # Append the overall data to the scraped_data list scraped_data.append(reviews_data) # Output the scraped data print(f"Scraped data: {scraped_data}") except Exception as e: print(f"Error: {e}") finally: # Ensure driver quits after scraping driver.quit() if __name__ == "__main__": target_com_scraper()
在此代码中,每个模块都有特定的用途来构建我们的网络抓取工具:
在此步骤中,您将初始化 Selenium 的 Chrome WebDriver 并配置重要的浏览器选项。这些选项包括禁用不必要的功能以提高性能、设置窗口大小和管理日志。您将使用 webdriver.Chrome() 实例化 WebDriver,以在整个抓取过程中控制浏览器。
python3 -m venv env
在本节中,我们创建一个滚动整个页面的函数。 Target.com 网站在用户向下滚动时动态加载其他内容(例如评论)。
# On Unix or MacOS (bash shell): /path/to/venv/bin/activate # On Unix or MacOS (csh shell): /path/to/venv/bin/activate.csh # On Unix or MacOS (fish shell): /path/to/venv/bin/activate.fish # On Windows (command prompt): \path\to\venv\Scripts\activate.bat # On Windows (PowerShell): \path\to\venv\Scripts\Activate.ps1
scroll_down_page() 函数逐渐滚动网页一定数量的像素(距离),每次滚动之间有短暂的暂停(延迟)。它首先计算页面的总高度并向下滚动直到到达底部。当它滚动时,总页面高度会动态更新,以适应在此过程中可能加载的新内容。
在本节中,我们结合 Selenium 和 BeautifulSoup 的优势来创建高效可靠的网页抓取设置。虽然 Selenium 用于与动态内容交互,例如加载页面和处理 JavaScript 渲染元素,但 BeautifulSoup 在解析和提取静态 HTML 元素方面更有效。我们首先使用 Selenium 导航网页并等待加载特定元素,例如产品评级和评论计数。这些元素是使用 Selenium 的 WebDriverWait 函数提取的,该函数确保数据在捕获之前是可见的。然而,仅通过 Selenium 处理个人评论可能会变得复杂且低效。
使用 BeautifulSoup,我们简化了循环浏览页面上多个评论的过程。一旦 Selenium 完全加载页面,BeautifulSoup 就会解析 HTML 内容以有效地提取评论。使用 BeautifulSoup 的 select() 和 select_one() 方法,我们可以导航页面结构并收集每个评论的标题、评级、时间和文本。与单独通过 Selenium 管理所有内容相比,这种方法可以更清晰、更结构化地抓取重复元素(例如评论列表),并在处理 HTML 方面提供更大的灵活性。
import os import time from bs4 import BeautifulSoup from dotenv import load_dotenv from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Load environment variables load_dotenv() def target_com_scraper(): """ SCRAPER SETTINGS - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit """ API_KEY = os.getenv("API_KEY", "yourapikey") # ScraperAPI proxy settings (with HTTP and HTTPS variants) scraper_api_proxies = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' } } # URLs to scrape url_list = [ "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab", ] # Store scraped data scraped_data = [] # Setup Selenium options with proxy options = Options() # options.add_argument("--headless") # Uncomment for headless mode options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-extensions") options.add_argument("--disable-in-process-stack-traces") options.add_argument("--window-size=1920,1080") options.add_argument("--log-level=3") options.add_argument("--disable-logging") options.add_argument("--start-maximized") # Initialize Selenium WebDriver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies) def scroll_down_page(distance=100, delay=0.2): """ Scroll down the page gradually until the end. Args: - distance: Number of pixels to scroll by in each step. - delay: Time (in seconds) to wait between scrolls. """ total_height = driver.execute_script( "return document.body.scrollHeight") scrolled_height = 0 while scrolled_height < total_height: # Scroll down by 'distance' pixels driver.execute_script(f"window.scrollBy(0, {distance});") scrolled_height += distance time.sleep(delay) # Pause between scrolls # Update the total page height after scrolling total_height = driver.execute_script( "return document.body.scrollHeight") print("Finished scrolling.") try: for url in url_list: # Use Selenium to load the page driver.get(url) time.sleep(5) # Give the page time to load # Scroll down the page scroll_down_page() # Extract single elements with Selenium def extract_element_text(selector, description): try: # Wait for the element and extract text element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located( (By.CSS_SELECTOR, selector)) ) text = element.text.strip() return text if text else None # Return None if the text is empty except TimeoutException: print(f"Timeout: Could not find {description}. Setting to None.") return None except NoSuchElementException: print(f"Element not found: {description}. Setting to None.") return None # Extract single elements reviews_data = {} reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']", "secondary_rating") reviews_data["rating_count"] = extract_element_text( "div[data-test='rating-count']", "rating_count") reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']", "rating_histogram") reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']", "percent_recommended") reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']", "total_recommendations") # Extract reviews from 'reviews-list' scraped_reviews = [] # Use Beautiful Soup to extract other content soup = BeautifulSoup(driver.page_source, 'html.parser') # Select all reviews in the list using BeautifulSoup reviews_list = soup.select("div[data-test='reviews-list'] > div") for review in reviews_list: # Create a dictionary to store each review's data ratings = {} # Extract title title_element = review.select_one( "h4[data-test='review-card--title']") ratings['title'] = title_element.text.strip( ) if title_element else None # Extract rating rating_element = review.select_one("span[data-test='ratings']") ratings['rating'] = rating_element.text.strip( ) if rating_element else None # Extract time time_element = review.select_one( "span[data-test='review-card--reviewTime']") ratings['time'] = time_element.text.strip( ) if time_element else None # Extract review text text_element = review.select_one( "div[data-test='review-card--text']") ratings['text'] = text_element.text.strip( ) if text_element else None # Append each review to the list of reviews scraped_reviews.append(ratings) # Append the list of reviews to the main product data reviews_data["reviews"] = scraped_reviews # Append the overall data to the scraped_data list scraped_data.append(reviews_data) # Output the scraped data print(f"Scraped data: {scraped_data}") except Exception as e: print(f"Error: {e}") finally: # Ensure driver quits after scraping driver.quit() if __name__ == "__main__": target_com_scraper()
抓取复杂网站时,尤其是那些具有强大反机器人措施(例如 Target.com)的网站时,经常会出现 IP 禁令、速率限制或访问限制等挑战。使用 Selenium 执行此类任务会变得很复杂,尤其是在部署无头浏览器时。无头浏览器允许在没有 GUI 的情况下进行交互,但在这种环境中手动管理代理变得具有挑战性。您必须配置代理设置、轮换 IP 并处理 JavaScript 渲染等其他交互,这使得抓取速度变慢且容易失败。
相比之下,ScraperAPI 通过自动管理代理显着简化了此过程。 ScraperAPI 的代理模式不是在 Selenium 中处理手动配置,而是跨多个 IP 地址分发请求,确保更顺畅的抓取,而无需担心 IP 禁令、速率限制或地理限制。当使用无头浏览器时,这变得特别有用,因为处理动态内容和复杂的站点交互需要额外的编码。
通过使用 Selenium Wire(一种允许轻松进行代理配置的工具),可以简化 ScraperAPI 代理模式与 Selenium 的集成。这是一个快速设置:
集成后,此配置可以实现与动态页面、自动轮换 IP 地址和绕过速率限制的更顺畅交互,而无需在无头浏览器环境中手动管理代理的麻烦。
下面的代码片段演示了如何在 Python 中配置 ScraperAPI 的代理:
import os import time from bs4 import BeautifulSoup from dotenv import load_dotenv from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Load environment variables load_dotenv() def target_com_scraper(): """ SCRAPER SETTINGS - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit """ API_KEY = os.getenv("API_KEY", "yourapikey") # ScraperAPI proxy settings (with HTTP and HTTPS variants) scraper_api_proxies = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' } } # URLs to scrape url_list = [ "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab", ] # Store scraped data scraped_data = [] # Setup Selenium options with proxy options = Options() # options.add_argument("--headless") # Uncomment for headless mode options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-extensions") options.add_argument("--disable-in-process-stack-traces") options.add_argument("--window-size=1920,1080") options.add_argument("--log-level=3") options.add_argument("--disable-logging") options.add_argument("--start-maximized") # Initialize Selenium WebDriver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies) def scroll_down_page(distance=100, delay=0.2): """ Scroll down the page gradually until the end. Args: - distance: Number of pixels to scroll by in each step. - delay: Time (in seconds) to wait between scrolls. """ total_height = driver.execute_script( "return document.body.scrollHeight") scrolled_height = 0 while scrolled_height < total_height: # Scroll down by 'distance' pixels driver.execute_script(f"window.scrollBy(0, {distance});") scrolled_height += distance time.sleep(delay) # Pause between scrolls # Update the total page height after scrolling total_height = driver.execute_script( "return document.body.scrollHeight") print("Finished scrolling.") try: for url in url_list: # Use Selenium to load the page driver.get(url) time.sleep(5) # Give the page time to load # Scroll down the page scroll_down_page() # Extract single elements with Selenium def extract_element_text(selector, description): try: # Wait for the element and extract text element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located( (By.CSS_SELECTOR, selector)) ) text = element.text.strip() return text if text else None # Return None if the text is empty except TimeoutException: print(f"Timeout: Could not find {description}. Setting to None.") return None except NoSuchElementException: print(f"Element not found: {description}. Setting to None.") return None # Extract single elements reviews_data = {} reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']", "secondary_rating") reviews_data["rating_count"] = extract_element_text( "div[data-test='rating-count']", "rating_count") reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']", "rating_histogram") reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']", "percent_recommended") reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']", "total_recommendations") # Extract reviews from 'reviews-list' scraped_reviews = [] # Use Beautiful Soup to extract other content soup = BeautifulSoup(driver.page_source, 'html.parser') # Select all reviews in the list using BeautifulSoup reviews_list = soup.select("div[data-test='reviews-list'] > div") for review in reviews_list: # Create a dictionary to store each review's data ratings = {} # Extract title title_element = review.select_one( "h4[data-test='review-card--title']") ratings['title'] = title_element.text.strip( ) if title_element else None # Extract rating rating_element = review.select_one("span[data-test='ratings']") ratings['rating'] = rating_element.text.strip( ) if rating_element else None # Extract time time_element = review.select_one( "span[data-test='review-card--reviewTime']") ratings['time'] = time_element.text.strip( ) if time_element else None # Extract review text text_element = review.select_one( "div[data-test='review-card--text']") ratings['text'] = text_element.text.strip( ) if text_element else None # Append each review to the list of reviews scraped_reviews.append(ratings) # Append the list of reviews to the main product data reviews_data["reviews"] = scraped_reviews # Append the overall data to the scraped_data list scraped_data.append(reviews_data) # Output the scraped data print(f"Scraped data: {scraped_data}") except Exception as e: print(f"Error: {e}") finally: # Ensure driver quits after scraping driver.quit() if __name__ == "__main__": target_com_scraper()
通过此设置,发送到 ScraperAPI 代理服务器的请求将被重定向到 Target.com 网站,从而隐藏您的真实 IP,并针对 Target.com 网站反抓取机制提供强大的防御。还可以通过包含用于 JavaScript 渲染的 render=true 等参数或指定用于地理位置的国家/地区代码来自定义代理。
下面的 JSON 代码是使用 Target Reviews Scraper 的响应示例:
python3 -m venv env
如果您想在不设置环境、不知道如何编码或设置代理的情况下快速获得 Target.com 评论,您可以使用我们的 Target Scraper API 免费获取所需的数据。 Target Scraper API 托管在 Apify 平台上,无需设置即可使用。
前往 Apify 并点击“免费试用”立即开始。
现在您已经有了 Target.com 评论和评级数据,是时候了解这些数据了。这些评论和评级数据可以提供有关客户对特定产品或服务的看法的宝贵见解。通过分析这些评论,您可以识别常见的赞扬和投诉,衡量客户满意度,预测未来的行为,并将这些评论转化为可行的见解。
作为营销专业人士或企业主,寻求更好地了解主要受众并改进营销和产品策略的方法。您可以通过以下一些方法将这些数据转化为可行的见解,以优化营销工作、改进产品策略并提高客户参与度:
通过使用 ScraperAPI 大规模收集大规模评论数据,您可以自动化和扩展情感分析,从而实现更好的决策和增长。
是的,从 Target.com 获取公开信息(例如产品评级和评论)是合法的。但重要的是要记住,这些公共信息可能仍然包含个人详细信息。
我们写了一篇关于网络抓取的法律方面和道德考虑的博客文章。您可以在那里了解更多信息。
是的,Target.com 实施了各种反抓取措施来阻止自动抓取。其中包括 IP 阻止、速率限制和验证码挑战,所有这些都旨在检测和阻止来自抓取工具或机器人的过多自动请求。
为了避免被 Target.com 屏蔽,您应该减慢请求速度、轮换用户代理、使用验证码解决技术,并避免发出重复或高频请求。将这些方法与代理相结合有助于降低检测的可能性。
此外,请考虑使用专用抓取工具(例如 Target Scraper API 或 Scraping API)来绕过这些 Target.com 限制。
是的,使用代理对于有效抓取 Target.com 至关重要。代理有助于跨多个 IP 地址分发请求,最大限度地减少被阻止的可能性。 ScraperAPI 代理隐藏您的 IP,使反抓取系统更难以检测您的活动。
在本文中,您学习了如何使用 Python、Selenium 构建 Target.com 评分和评论抓取工具,并使用 ScraperAPI 有效绕过 Target.com 的反抓取机制,避免 IP 封禁并提高抓取性能。
使用此工具,您可以高效可靠地收集有价值的客户反馈。
收集完这些数据后,下一步就是使用情绪分析来发现关键见解。通过分析客户评论,您作为企业可以确定产品优势、解决痛点并优化营销策略,以更好地满足客户需求。
通过使用 Target Scraper API 进行大规模数据收集,您可以持续监控评论并在了解客户情绪方面保持领先地位,从而使您能够完善产品开发并创建更有针对性的营销活动。
立即尝试 ScraperAPI 进行无缝大规模数据提取或使用我们的 Cloud Target.com Reviews Scraper!
欲了解更多教程和精彩内容,请在 Twitter (X) @eunit99 上关注我
以上是如何使用 Python 抓取 Target.com 评论的详细内容。更多信息请关注PHP中文网其他相关文章!