首页 >后端开发 >Python教程 >如何使用 Python 抓取 Target.com 评论

如何使用 Python 抓取 Target.com 评论

DDD
DDD原创
2025-01-06 06:41:41504浏览

介绍

Target.com 是美国最大的电子商务和购物市场之一。它允许消费者在网上和店内购买从杂货和必需品到服装和电子产品的所有商品。截至2024年9月,根据SimilarWeb的数据,Target.com每月吸引的网络流量超过1.66亿。

Target.com 网站提供客户评论、动态定价信息、产品比较和产品评级等。对于想要跟踪产品趋势、监控竞争对手价格或通过评论分析客户情绪的分析师、营销团队、企业或研究人员来说,它是宝贵的数据来源。

How to Scrape Target.com Reviews with Python

在本文中,您将学习如何:

  • 设置并安装 Python、Selenium 和 Beautiful Soup 以进行网页抓取
  • 使用 Python 从 Target.com 抓取产品评论和评级
  • 使用ScraperAPI有效绕过Target.com的反抓取机制
  • 实施代理以避免 IP 禁令并提高抓取性能

在本文结束时,您将了解如何使用 Python、Selenium 和 ScraperAPI 从 Target.com 收集产品评论和评级而不被屏蔽。您还将学习如何使用抓取的数据进行情感分析。

如果您在我编写本教程时感到兴奋,那么让我们直接开始吧。?

TL;DR:抓取目标产品评论 [完整代码]

对于那些赶时间的人,这是我们将在本教程的基础上构建的完整代码片段:

import os
import time

from bs4 import BeautifulSoup
from dotenv import load_dotenv
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Load environment variables
load_dotenv()


def target_com_scraper():
    """
    SCRAPER SETTINGS

    - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit
    """

    API_KEY = os.getenv("API_KEY", "yourapikey")

    # ScraperAPI proxy settings (with HTTP and HTTPS variants)
    scraper_api_proxies = {
        'proxy': {
            'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }

    # URLs to scrape
    url_list = [
        "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab",
    ]

    # Store scraped data
    scraped_data = []

    # Setup Selenium options with proxy
    options = Options()
    # options.add_argument("--headless")  # Uncomment for headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-in-process-stack-traces")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--disable-logging")
    options.add_argument("--start-maximized")

    # Initialize Selenium WebDriver
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies)

    def scroll_down_page(distance=100, delay=0.2):
        """
        Scroll down the page gradually until the end.

        Args:
        - distance: Number of pixels to scroll by in each step.
        - delay: Time (in seconds) to wait between scrolls.
        """
        total_height = driver.execute_script(
            "return document.body.scrollHeight")
        scrolled_height = 0

        while scrolled_height < total_height:
            # Scroll down by 'distance' pixels
            driver.execute_script(f"window.scrollBy(0, {distance});")
            scrolled_height += distance
            time.sleep(delay)  # Pause between scrolls

            # Update the total page height after scrolling
            total_height = driver.execute_script(
                "return document.body.scrollHeight")

        print("Finished scrolling.")

    try:
        for url in url_list:

            # Use Selenium to load the page
            driver.get(url)
            time.sleep(5)  # Give the page time to load

            # Scroll down the page
            scroll_down_page()

            # Extract single elements with Selenium
            def extract_element_text(selector, description):
                try:
                    # Wait for the element and extract text
                    element = WebDriverWait(driver, 5).until(
                        EC.visibility_of_element_located(
                            (By.CSS_SELECTOR, selector))
                    )
                    text = element.text.strip()
                    return text if text else None  # Return None if the text is empty
                except TimeoutException:
                    print(f"Timeout: Could not find {description}. Setting to None.")
                    return None
                except NoSuchElementException:
                    print(f"Element not found: {description}. Setting to None.")
                    return None

            # Extract single elements
            reviews_data = {}

            reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']",
                                                                    "secondary_rating")
            reviews_data["rating_count"] = extract_element_text(
                "div[data-test='rating-count']", "rating_count")
            reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']",
                                                                    "rating_histogram")
            reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']",
                                                                       "percent_recommended")
            reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']",
                                                                         "total_recommendations")

            # Extract reviews from 'reviews-list'
            scraped_reviews = []

            # Use Beautiful Soup to extract other content
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # Select all reviews in the list using BeautifulSoup
            reviews_list = soup.select("div[data-test='reviews-list'] > div")

            for review in reviews_list:
                # Create a dictionary to store each review's data
                ratings = {}

                # Extract title
                title_element = review.select_one(
                    "h4[data-test='review-card--title']")
                ratings['title'] = title_element.text.strip(
                ) if title_element else None

                # Extract rating
                rating_element = review.select_one("span[data-test='ratings']")
                ratings['rating'] = rating_element.text.strip(
                ) if rating_element else None

                # Extract time
                time_element = review.select_one(
                    "span[data-test='review-card--reviewTime']")
                ratings['time'] = time_element.text.strip(
                ) if time_element else None

                # Extract review text
                text_element = review.select_one(
                    "div[data-test='review-card--text']")
                ratings['text'] = text_element.text.strip(
                ) if text_element else None

                # Append each review to the list of reviews
                scraped_reviews.append(ratings)

            # Append the list of reviews to the main product data
            reviews_data["reviews"] = scraped_reviews

            # Append the overall data to the scraped_data list
            scraped_data.append(reviews_data)

        # Output the scraped data
        print(f"Scraped data: {scraped_data}")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Ensure driver quits after scraping
        driver.quit()


if __name__ == "__main__":
    target_com_scraper()

查看 GitHub 上的完整代码:https://github.com/Eunit99/target_com_scraper。想理解每一行代码吗?让我们一起从头开始构建网络抓取工具!

如何使用 Python 和 ScraperAPI 抓取 Target.com 评论

在之前的文章中,我们介绍了抓取 Target.com 产品数据所需了解的所有内容。不过,在本文中,我将重点介绍如何使用 Python 和 ScraperAPI 抓取 Target.com 的产品评级和评论。

先决条件

要遵循本教程并开始抓取 Target.com,您需要首先执行一些操作。

1. 拥有 ScraperAPI 帐户

从 ScraperAPI 上的免费帐户开始。 ScraperAPI 允许您使用我们易于使用的 Web 抓取 API 开始从数百万个 Web 源收集数据,而无需复杂且昂贵的解决方法。

ScraperAPI 甚至可以解锁最难的网站,降低基础设施和开发成本,让您更快地部署网络抓取工具,并且还为您提供 1,000 个免费 API 积分以供您首先尝试,等等。

2. 文本编辑器或IDE

使用代码编辑器,例如Visual Studio Code。其他选项包括 Sublime TextPyCharm

3. 项目要求和虚拟环境设置

开始抓取 Target.com 评论之前,请确保您具备以下条件:

  • 您的计算机上安装了 Python(版本 3.10 或更高版本)
  • pip(Python 包安装程序)

最佳实践是为 Python 项目使用虚拟环境来管理依赖关系并避免冲突。

要创建虚拟环境,请在终端中运行以下命令:

import os
import time

from bs4 import BeautifulSoup
from dotenv import load_dotenv
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Load environment variables
load_dotenv()


def target_com_scraper():
    """
    SCRAPER SETTINGS

    - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit
    """

    API_KEY = os.getenv("API_KEY", "yourapikey")

    # ScraperAPI proxy settings (with HTTP and HTTPS variants)
    scraper_api_proxies = {
        'proxy': {
            'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }

    # URLs to scrape
    url_list = [
        "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab",
    ]

    # Store scraped data
    scraped_data = []

    # Setup Selenium options with proxy
    options = Options()
    # options.add_argument("--headless")  # Uncomment for headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-in-process-stack-traces")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--disable-logging")
    options.add_argument("--start-maximized")

    # Initialize Selenium WebDriver
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies)

    def scroll_down_page(distance=100, delay=0.2):
        """
        Scroll down the page gradually until the end.

        Args:
        - distance: Number of pixels to scroll by in each step.
        - delay: Time (in seconds) to wait between scrolls.
        """
        total_height = driver.execute_script(
            "return document.body.scrollHeight")
        scrolled_height = 0

        while scrolled_height < total_height:
            # Scroll down by 'distance' pixels
            driver.execute_script(f"window.scrollBy(0, {distance});")
            scrolled_height += distance
            time.sleep(delay)  # Pause between scrolls

            # Update the total page height after scrolling
            total_height = driver.execute_script(
                "return document.body.scrollHeight")

        print("Finished scrolling.")

    try:
        for url in url_list:

            # Use Selenium to load the page
            driver.get(url)
            time.sleep(5)  # Give the page time to load

            # Scroll down the page
            scroll_down_page()

            # Extract single elements with Selenium
            def extract_element_text(selector, description):
                try:
                    # Wait for the element and extract text
                    element = WebDriverWait(driver, 5).until(
                        EC.visibility_of_element_located(
                            (By.CSS_SELECTOR, selector))
                    )
                    text = element.text.strip()
                    return text if text else None  # Return None if the text is empty
                except TimeoutException:
                    print(f"Timeout: Could not find {description}. Setting to None.")
                    return None
                except NoSuchElementException:
                    print(f"Element not found: {description}. Setting to None.")
                    return None

            # Extract single elements
            reviews_data = {}

            reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']",
                                                                    "secondary_rating")
            reviews_data["rating_count"] = extract_element_text(
                "div[data-test='rating-count']", "rating_count")
            reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']",
                                                                    "rating_histogram")
            reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']",
                                                                       "percent_recommended")
            reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']",
                                                                         "total_recommendations")

            # Extract reviews from 'reviews-list'
            scraped_reviews = []

            # Use Beautiful Soup to extract other content
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # Select all reviews in the list using BeautifulSoup
            reviews_list = soup.select("div[data-test='reviews-list'] > div")

            for review in reviews_list:
                # Create a dictionary to store each review's data
                ratings = {}

                # Extract title
                title_element = review.select_one(
                    "h4[data-test='review-card--title']")
                ratings['title'] = title_element.text.strip(
                ) if title_element else None

                # Extract rating
                rating_element = review.select_one("span[data-test='ratings']")
                ratings['rating'] = rating_element.text.strip(
                ) if rating_element else None

                # Extract time
                time_element = review.select_one(
                    "span[data-test='review-card--reviewTime']")
                ratings['time'] = time_element.text.strip(
                ) if time_element else None

                # Extract review text
                text_element = review.select_one(
                    "div[data-test='review-card--text']")
                ratings['text'] = text_element.text.strip(
                ) if text_element else None

                # Append each review to the list of reviews
                scraped_reviews.append(ratings)

            # Append the list of reviews to the main product data
            reviews_data["reviews"] = scraped_reviews

            # Append the overall data to the scraped_data list
            scraped_data.append(reviews_data)

        # Output the scraped data
        print(f"Scraped data: {scraped_data}")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Ensure driver quits after scraping
        driver.quit()


if __name__ == "__main__":
    target_com_scraper()

4. 激活虚拟环境

根据您的操作系统激活虚拟环境:

python3 -m venv env

有些IDE可以自动激活虚拟环境。

5. 对 CSS 选择器和导航浏览器开发工具有基本的了解

为了有效地理解本文,必须对 CSS 选择器有基本的了解。 CSS 选择器用于定位网页上的特定 HTML 元素,它允许您提取所需的信息。

此外,熟悉浏览器开发工具对于检查和识别网页结构至关重要。

项目设置

满足上述先决条件后,就可以开始设置您的项目了。首先创建一个包含 Target.com 抓取工具源代码的文件夹。在这种情况下,我将我的文件夹命名为 python-target-dot-com-scraper。

运行以下命令创建名为 python-target-dot-com-scraper 的文件夹:

# On Unix or MacOS (bash shell):
/path/to/venv/bin/activate

# On Unix or MacOS (csh shell):
/path/to/venv/bin/activate.csh

# On Unix or MacOS (fish shell):
/path/to/venv/bin/activate.fish

# On Windows (command prompt):
\path\to\venv\Scripts\activate.bat

# On Windows (PowerShell):
\path\to\venv\Scripts\Activate.ps1

进入文件夹并通过运行以下命令创建一个新的 Python main.py 文件:

mkdir python-target-dot-com-scraper

通过运行以下命令创建requirements.txt 文件:

cd python-target-dot-com-scraper && touch main.py

在本文中,我将使用 Selenium 和 Beautiful Soup 以及 Python 库的 Webdriver Manager 来构建网络抓取工具。 Selenium 将处理浏览器自动化,Beautiful Soup 库将从 Target.com 网站的 HTML 内容中提取数据。同时,Python 的 Webdriver Manager 提供了一种自动管理不同浏览器驱动程序的方法。

将以下行添加到您的requirements.txt 文件中以指定必要的包:

touch requirements.txt

要安装软件包,请运行以下命令:

selenium~=4.25.0
bs4~=0.0.2
python-dotenv~=1.0.1
webdriver_manager
selenium-wire
blinker==1.7.0
python-dotenv==1.0.1

使用 Selenium 提取 Target.com 产品评论

在本节中,我将引导您逐步了解如何从 Target.com 的产品页面(例如 Target.com)获取产品评级和评论。

How to Scrape Target.com Reviews with Python

我将重点关注以下屏幕截图中突出显示的网站这些部分的评论和评级:

How to Scrape Target.com Reviews with Python

在进一步深入研究之前,您需要了解 HTML 结构并识别与包装我们要提取的信息的 HTML 标签关联的 DOM 选择器。在下一节中,我将引导您使用 Chrome DevTools 来了解 Target.com 的网站结构。

使用 Chrome DevTools 了解 Target.com 的网站结构

按 F12 或右键单击页面上的任意位置并选择“检查”,打开 Chrome DevTools。从上面的 URL 检查页面会发现以下内容:

How to Scrape Target.com Reviews with Python

How to Scrape Target.com Reviews with Python

从上面的图片中,以下是网络抓取工具将用于提取信息的所有 DOM 选择器:

Information DOM selector Value
Product ratings
Rating value div[data-test='rating-value'] 4.7
Rating count div[data-test='rating-count'] 683 star ratings
Secondary rating div[data-test='secondary-rating'] 683 star ratings
Rating histogram div[data-test='rating-histogram'] 5 stars 85%4 stars 8%3 stars 3%2 stars 1%1 star 2%
Percent recommended div[data-test='percent-recommended'] 89% would recommend
Total recommendations div[data-test='total-recommendations'] 125 recommendations
Product reviews
Reviews list div[data-test='reviews-list'] Returns children elements corresponding to individual product review
Review card title h4[data-test='review-card--title'] Perfect litter box for cats
Ratings span[data-test='ratings'] 4.7 out of 5 stars with 683 reviews
Review time span[data-test='review-card--reviewTime'] 23 days ago
Review card text div[data-test='review-card--text'] My cats love it. Doesn't take up much space either

建立你的目标评论刮刀

现在我们已经概述了所有要求,并在 Target.com 产品评论页面上找到了我们感兴趣的不同元素。我们将进入下一步,导入必要的模块:

1. 导入Selenium等模块

import os
import time

from bs4 import BeautifulSoup
from dotenv import load_dotenv
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Load environment variables
load_dotenv()


def target_com_scraper():
    """
    SCRAPER SETTINGS

    - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit
    """

    API_KEY = os.getenv("API_KEY", "yourapikey")

    # ScraperAPI proxy settings (with HTTP and HTTPS variants)
    scraper_api_proxies = {
        'proxy': {
            'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }

    # URLs to scrape
    url_list = [
        "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab",
    ]

    # Store scraped data
    scraped_data = []

    # Setup Selenium options with proxy
    options = Options()
    # options.add_argument("--headless")  # Uncomment for headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-in-process-stack-traces")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--disable-logging")
    options.add_argument("--start-maximized")

    # Initialize Selenium WebDriver
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies)

    def scroll_down_page(distance=100, delay=0.2):
        """
        Scroll down the page gradually until the end.

        Args:
        - distance: Number of pixels to scroll by in each step.
        - delay: Time (in seconds) to wait between scrolls.
        """
        total_height = driver.execute_script(
            "return document.body.scrollHeight")
        scrolled_height = 0

        while scrolled_height < total_height:
            # Scroll down by 'distance' pixels
            driver.execute_script(f"window.scrollBy(0, {distance});")
            scrolled_height += distance
            time.sleep(delay)  # Pause between scrolls

            # Update the total page height after scrolling
            total_height = driver.execute_script(
                "return document.body.scrollHeight")

        print("Finished scrolling.")

    try:
        for url in url_list:

            # Use Selenium to load the page
            driver.get(url)
            time.sleep(5)  # Give the page time to load

            # Scroll down the page
            scroll_down_page()

            # Extract single elements with Selenium
            def extract_element_text(selector, description):
                try:
                    # Wait for the element and extract text
                    element = WebDriverWait(driver, 5).until(
                        EC.visibility_of_element_located(
                            (By.CSS_SELECTOR, selector))
                    )
                    text = element.text.strip()
                    return text if text else None  # Return None if the text is empty
                except TimeoutException:
                    print(f"Timeout: Could not find {description}. Setting to None.")
                    return None
                except NoSuchElementException:
                    print(f"Element not found: {description}. Setting to None.")
                    return None

            # Extract single elements
            reviews_data = {}

            reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']",
                                                                    "secondary_rating")
            reviews_data["rating_count"] = extract_element_text(
                "div[data-test='rating-count']", "rating_count")
            reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']",
                                                                    "rating_histogram")
            reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']",
                                                                       "percent_recommended")
            reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']",
                                                                         "total_recommendations")

            # Extract reviews from 'reviews-list'
            scraped_reviews = []

            # Use Beautiful Soup to extract other content
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # Select all reviews in the list using BeautifulSoup
            reviews_list = soup.select("div[data-test='reviews-list'] > div")

            for review in reviews_list:
                # Create a dictionary to store each review's data
                ratings = {}

                # Extract title
                title_element = review.select_one(
                    "h4[data-test='review-card--title']")
                ratings['title'] = title_element.text.strip(
                ) if title_element else None

                # Extract rating
                rating_element = review.select_one("span[data-test='ratings']")
                ratings['rating'] = rating_element.text.strip(
                ) if rating_element else None

                # Extract time
                time_element = review.select_one(
                    "span[data-test='review-card--reviewTime']")
                ratings['time'] = time_element.text.strip(
                ) if time_element else None

                # Extract review text
                text_element = review.select_one(
                    "div[data-test='review-card--text']")
                ratings['text'] = text_element.text.strip(
                ) if text_element else None

                # Append each review to the list of reviews
                scraped_reviews.append(ratings)

            # Append the list of reviews to the main product data
            reviews_data["reviews"] = scraped_reviews

            # Append the overall data to the scraped_data list
            scraped_data.append(reviews_data)

        # Output the scraped data
        print(f"Scraped data: {scraped_data}")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Ensure driver quits after scraping
        driver.quit()


if __name__ == "__main__":
    target_com_scraper()

在此代码中,每个模块都有特定的用途来构建我们的网络抓取工具:

  • os 处理 API 密钥等环境变量。
  • 时间会在页面加载过程中引入延迟。
  • dotenv 从 .env 文件加载 API 密钥。
  • selenium 支持浏览器自动化和交互。
  • webdriver_manager 自动安装 ChromeDriver。
  • BeautifulSoup 解析 HTML 以提取数据。
  • seleniumwire 管理抓取代理,无需 IP 禁令。

2. 设置网络驱动程序

在此步骤中,您将初始化 Selenium 的 Chrome WebDriver 并配置重要的浏览器选项。这些选项包括禁用不必要的功能以提高性能、设置窗口大小和管理日志。您将使用 webdriver.Chrome() 实例化 WebDriver,以在整个抓取过程中控制浏览器。

python3 -m venv env

创建滚动到底部功能

在本节中,我们创建一个滚动整个页面的函数。 Target.com 网站在用户向下滚动时动态加载其他内容(例如评论)。

# On Unix or MacOS (bash shell):
/path/to/venv/bin/activate

# On Unix or MacOS (csh shell):
/path/to/venv/bin/activate.csh

# On Unix or MacOS (fish shell):
/path/to/venv/bin/activate.fish

# On Windows (command prompt):
\path\to\venv\Scripts\activate.bat

# On Windows (PowerShell):
\path\to\venv\Scripts\Activate.ps1

scroll_down_page() 函数逐渐滚动网页一定数量的像素(距离),每次滚动之间有短暂的暂停(延迟)。它首先计算页面的总高度并向下滚动直到到达底部。当它滚动时,总页面高度会动态更新,以适应在此过程中可能加载的新内容。

将 Selenium 与 BeautifulSoup 结合起来

在本节中,我们结合 Selenium 和 BeautifulSoup 的优势来创建高效可靠的网页抓取设置。虽然 Selenium 用于与动态内容交互,例如加载页面和处理 JavaScript 渲染元素,但 BeautifulSoup 在解析和提取静态 HTML 元素方面更有效。我们首先使用 Selenium 导航网页并等待加载特定元素,例如产品评级和评论计数。这些元素是使用 Selenium 的 WebDriverWait 函数提取的,该函数确保数据在捕获之前是可见的。然而,仅通过 Selenium 处理个人评论可能会变得复杂且低效。

使用 BeautifulSoup,我们简化了循环浏览页面上多个评论的过程。一旦 Selenium 完全加载页面,BeautifulSoup 就会解析 HTML 内容以有效地提取评论。使用 BeautifulSoup 的 select() 和 select_one() 方法,我们可以导航页面结构并收集每个评论的标题、评级、时间和文本。与单独通过 Selenium 管理所有内容相比,这种方法可以更清晰、更结构化地抓取重复元素(例如评论列表),并在处理 HTML 方面提供更大的灵活性。

import os
import time

from bs4 import BeautifulSoup
from dotenv import load_dotenv
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Load environment variables
load_dotenv()


def target_com_scraper():
    """
    SCRAPER SETTINGS

    - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit
    """

    API_KEY = os.getenv("API_KEY", "yourapikey")

    # ScraperAPI proxy settings (with HTTP and HTTPS variants)
    scraper_api_proxies = {
        'proxy': {
            'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }

    # URLs to scrape
    url_list = [
        "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab",
    ]

    # Store scraped data
    scraped_data = []

    # Setup Selenium options with proxy
    options = Options()
    # options.add_argument("--headless")  # Uncomment for headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-in-process-stack-traces")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--disable-logging")
    options.add_argument("--start-maximized")

    # Initialize Selenium WebDriver
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies)

    def scroll_down_page(distance=100, delay=0.2):
        """
        Scroll down the page gradually until the end.

        Args:
        - distance: Number of pixels to scroll by in each step.
        - delay: Time (in seconds) to wait between scrolls.
        """
        total_height = driver.execute_script(
            "return document.body.scrollHeight")
        scrolled_height = 0

        while scrolled_height < total_height:
            # Scroll down by 'distance' pixels
            driver.execute_script(f"window.scrollBy(0, {distance});")
            scrolled_height += distance
            time.sleep(delay)  # Pause between scrolls

            # Update the total page height after scrolling
            total_height = driver.execute_script(
                "return document.body.scrollHeight")

        print("Finished scrolling.")

    try:
        for url in url_list:

            # Use Selenium to load the page
            driver.get(url)
            time.sleep(5)  # Give the page time to load

            # Scroll down the page
            scroll_down_page()

            # Extract single elements with Selenium
            def extract_element_text(selector, description):
                try:
                    # Wait for the element and extract text
                    element = WebDriverWait(driver, 5).until(
                        EC.visibility_of_element_located(
                            (By.CSS_SELECTOR, selector))
                    )
                    text = element.text.strip()
                    return text if text else None  # Return None if the text is empty
                except TimeoutException:
                    print(f"Timeout: Could not find {description}. Setting to None.")
                    return None
                except NoSuchElementException:
                    print(f"Element not found: {description}. Setting to None.")
                    return None

            # Extract single elements
            reviews_data = {}

            reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']",
                                                                    "secondary_rating")
            reviews_data["rating_count"] = extract_element_text(
                "div[data-test='rating-count']", "rating_count")
            reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']",
                                                                    "rating_histogram")
            reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']",
                                                                       "percent_recommended")
            reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']",
                                                                         "total_recommendations")

            # Extract reviews from 'reviews-list'
            scraped_reviews = []

            # Use Beautiful Soup to extract other content
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # Select all reviews in the list using BeautifulSoup
            reviews_list = soup.select("div[data-test='reviews-list'] > div")

            for review in reviews_list:
                # Create a dictionary to store each review's data
                ratings = {}

                # Extract title
                title_element = review.select_one(
                    "h4[data-test='review-card--title']")
                ratings['title'] = title_element.text.strip(
                ) if title_element else None

                # Extract rating
                rating_element = review.select_one("span[data-test='ratings']")
                ratings['rating'] = rating_element.text.strip(
                ) if rating_element else None

                # Extract time
                time_element = review.select_one(
                    "span[data-test='review-card--reviewTime']")
                ratings['time'] = time_element.text.strip(
                ) if time_element else None

                # Extract review text
                text_element = review.select_one(
                    "div[data-test='review-card--text']")
                ratings['text'] = text_element.text.strip(
                ) if text_element else None

                # Append each review to the list of reviews
                scraped_reviews.append(ratings)

            # Append the list of reviews to the main product data
            reviews_data["reviews"] = scraped_reviews

            # Append the overall data to the scraped_data list
            scraped_data.append(reviews_data)

        # Output the scraped data
        print(f"Scraped data: {scraped_data}")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Ensure driver quits after scraping
        driver.quit()


if __name__ == "__main__":
    target_com_scraper()

在 Python Selenium 中使用代理:与无头浏览器的复杂交互

抓取复杂网站时,尤其是那些具有强大反机器人措施(例如 Target.com)的网站时,经常会出现 IP 禁令、速率限制或访问限制等挑战。使用 Selenium 执行此类任务会变得很复杂,尤其是在部署无头浏览器时。无头浏览器允许在没有 GUI 的情况下进行交互,但在这种环境中手动管理代理变得具有挑战性。您必须配置代理设置、轮换 IP 并处理 JavaScript 渲染等其他交互,这使得抓取速度变慢且容易失败。

相比之下,ScraperAPI 通过自动管理代理显着简化了此过程。 ScraperAPI 的代理模式不是在 Selenium 中处理手动配置,而是跨多个 IP 地址分发请求,确保更顺畅的抓取,而无需担心 IP 禁令、速率限制或地理限制。当使用无头浏览器时,这变得特别有用,因为处理动态内容和复杂的站点交互需要额外的编码。

使用 Selenium 设置 ScraperAPI

通过使用 Selenium Wire(一种允许轻松进行代理配置的工具),可以简化 ScraperAPI 代理模式与 Selenium 的集成。这是一个快速设置:

  1. 注册 ScraperAPI:创建帐户并检索您的 API 密钥。
  2. 安装 Selenium Wire:通过运行 pip install selenium-wire 将标准 Selenium 替换为 Selenium Wire。
  3. 配置代理:在 WebDriver 设置中使用 ScraperAPI 的代理池来轻松管理 IP 轮换。

集成后,此配置可以实现与动态页面、自动轮换 IP 地址和绕过速率限制的更顺畅交互,而无需在无头浏览器环境中手动管理代理的麻烦。

下面的代码片段演示了如何在 Python 中配置 ScraperAPI 的代理:

import os
import time

from bs4 import BeautifulSoup
from dotenv import load_dotenv
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Load environment variables
load_dotenv()


def target_com_scraper():
    """
    SCRAPER SETTINGS

    - API_KEY: Your ScraperAPI key. Get your API Key ==> https://www.scraperapi.com/?fp_ref=eunit
    """

    API_KEY = os.getenv("API_KEY", "yourapikey")

    # ScraperAPI proxy settings (with HTTP and HTTPS variants)
    scraper_api_proxies = {
        'proxy': {
            'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }

    # URLs to scrape
    url_list = [
        "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab",
    ]

    # Store scraped data
    scraped_data = []

    # Setup Selenium options with proxy
    options = Options()
    # options.add_argument("--headless")  # Uncomment for headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-in-process-stack-traces")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--disable-logging")
    options.add_argument("--start-maximized")

    # Initialize Selenium WebDriver
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options, seleniumwire_options=scraper_api_proxies)

    def scroll_down_page(distance=100, delay=0.2):
        """
        Scroll down the page gradually until the end.

        Args:
        - distance: Number of pixels to scroll by in each step.
        - delay: Time (in seconds) to wait between scrolls.
        """
        total_height = driver.execute_script(
            "return document.body.scrollHeight")
        scrolled_height = 0

        while scrolled_height < total_height:
            # Scroll down by 'distance' pixels
            driver.execute_script(f"window.scrollBy(0, {distance});")
            scrolled_height += distance
            time.sleep(delay)  # Pause between scrolls

            # Update the total page height after scrolling
            total_height = driver.execute_script(
                "return document.body.scrollHeight")

        print("Finished scrolling.")

    try:
        for url in url_list:

            # Use Selenium to load the page
            driver.get(url)
            time.sleep(5)  # Give the page time to load

            # Scroll down the page
            scroll_down_page()

            # Extract single elements with Selenium
            def extract_element_text(selector, description):
                try:
                    # Wait for the element and extract text
                    element = WebDriverWait(driver, 5).until(
                        EC.visibility_of_element_located(
                            (By.CSS_SELECTOR, selector))
                    )
                    text = element.text.strip()
                    return text if text else None  # Return None if the text is empty
                except TimeoutException:
                    print(f"Timeout: Could not find {description}. Setting to None.")
                    return None
                except NoSuchElementException:
                    print(f"Element not found: {description}. Setting to None.")
                    return None

            # Extract single elements
            reviews_data = {}

            reviews_data["secondary_rating"] = extract_element_text("div[data-test='secondary-rating']",
                                                                    "secondary_rating")
            reviews_data["rating_count"] = extract_element_text(
                "div[data-test='rating-count']", "rating_count")
            reviews_data["rating_histogram"] = extract_element_text("div[data-test='rating-histogram']",
                                                                    "rating_histogram")
            reviews_data["percent_recommended"] = extract_element_text("div[data-test='percent-recommended']",
                                                                       "percent_recommended")
            reviews_data["total_recommendations"] = extract_element_text("div[data-test='total-recommendations']",
                                                                         "total_recommendations")

            # Extract reviews from 'reviews-list'
            scraped_reviews = []

            # Use Beautiful Soup to extract other content
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # Select all reviews in the list using BeautifulSoup
            reviews_list = soup.select("div[data-test='reviews-list'] > div")

            for review in reviews_list:
                # Create a dictionary to store each review's data
                ratings = {}

                # Extract title
                title_element = review.select_one(
                    "h4[data-test='review-card--title']")
                ratings['title'] = title_element.text.strip(
                ) if title_element else None

                # Extract rating
                rating_element = review.select_one("span[data-test='ratings']")
                ratings['rating'] = rating_element.text.strip(
                ) if rating_element else None

                # Extract time
                time_element = review.select_one(
                    "span[data-test='review-card--reviewTime']")
                ratings['time'] = time_element.text.strip(
                ) if time_element else None

                # Extract review text
                text_element = review.select_one(
                    "div[data-test='review-card--text']")
                ratings['text'] = text_element.text.strip(
                ) if text_element else None

                # Append each review to the list of reviews
                scraped_reviews.append(ratings)

            # Append the list of reviews to the main product data
            reviews_data["reviews"] = scraped_reviews

            # Append the overall data to the scraped_data list
            scraped_data.append(reviews_data)

        # Output the scraped data
        print(f"Scraped data: {scraped_data}")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Ensure driver quits after scraping
        driver.quit()


if __name__ == "__main__":
    target_com_scraper()

通过此设置,发送到 ScraperAPI 代理服务器的请求将被重定向到 Target.com 网站,从而隐藏您的真实 IP,并针对 Target.com 网站反抓取机制提供强大的防御。还可以通过包含用于 JavaScript 渲染的 render=true 等参数或指定用于地理位置的国家/地区代码来自定义代理。

从 Target.com 抓取评论数据

下面的 JSON 代码是使用 Target Reviews Scraper 的响应示例:

python3 -m venv env

How to Scrape Target.com Reviews with Python

如何使用我们的 Cloud Target.com 评论 Scraper

如果您想在不设置环境、不知道如何编码或设置代理的情况下快速获得 Target.com 评论,您可以使用我们的 Target Scraper API 免费获取所需的数据。 Target Scraper API 托管在 Apify 平台上,无需设置即可使用。

前往 Apify 并点击“免费试用”立即开始。

How to Scrape Target.com Reviews with Python

使用目标评论进行情感分析

现在您已经有了 Target.com 评论和评级数据,是时候了解这些数据了。这些评论和评级数据可以提供有关客户对特定产品或服务的看法的宝贵见解。通过分析这些评论,您可以识别常见的赞扬和投诉,衡量客户满意度,预测未来的行为,并将这些评论转化为可行的见解。

作为营销专业人士或企业主,寻求更好地了解主要受众并改进营销和产品策略的方法。您可以通过以下一些方法将这些数据转化为可行的见解,以优化营销工作、改进产品策略并提高客户参与度:

  • 完善产品供应:识别常见的客户投诉或赞扬,以微调产品功能。
  • 改善客户服务:及早发现负面评论以解决问题并保持客户满意度。
  • 优化营销活动:利用积极反馈中的见解来制定个性化、有针对性的活动。

通过使用 ScraperAPI 大规模收集大规模评论数据,您可以自动化和扩展情感分析,从而实现更好的决策和增长。

有关抓取目标产品评论的常见问题解答

抓取 Target.com 产品页面是否合法?

是的,从 Target.com 获取公开信息(例如产品评级和评论)是合法的。但重要的是要记住,这些公共信息可能仍然包含个人详细信息。

我们写了一篇关于网络抓取的法律方面和道德考虑的博客文章。您可以在那里了解更多信息。

Target.com 会阻止抓取工具吗?

是的,Target.com 实施了各种反抓取措施来阻止自动抓取。其中包括 IP 阻止、速率限制和验证码挑战,所有这些都旨在检测和阻止来自抓取工具或机器人的过多自动请求。

如何避免被 Target.com 屏蔽?

为了避免被 Target.com 屏蔽,您应该减慢请求速度、轮换用户代理、使用验证码解决技术,并避免发出重复或高频请求。将这些方法与代理相结合有助于降低检测的可能性。

此外,请考虑使用专用抓取工具(例如 Target Scraper API 或 Scraping API)来绕过这些 Target.com 限制。

我需要使用代理来抓取 Target.com 吗?

是的,使用代理对于有效抓取 Target.com 至关重要。代理有助于跨多个 IP 地址分发请求,最大限度地减少被阻止的可能性。 ScraperAPI 代理隐藏您的 IP,使反抓取系统更难以检测您的活动。

总结

在本文中,您学习了如何使用 Python、Selenium 构建 Target.com 评分和评论抓取工具,并使用 ScraperAPI 有效绕过 Target.com 的反抓取机制,避免 IP 封禁并提高抓取性能。

使用此工具,您可以高效可靠地收集有价值的客户反馈。

收集完这些数据后,下一步就是使用情绪分析来发现关键见解。通过分析客户评论,您作为企业可以确定产品优势、解决痛点并优化营销策略,以更好地满足客户需求。

通过使用 Target Scraper API 进行大规模数据收集,您可以持续监控评论并在了解客户情绪方面保持领先地位,从而使您能够完善产品开发并创建更有针对性的营销活动。

立即尝试 ScraperAPI 进行无缝大规模数据提取或使用我们的 Cloud Target.com Reviews Scraper!

欲了解更多教程和精彩内容,请在 Twitter (X) @eunit99 上关注我

以上是如何使用 Python 抓取 Target.com 评论的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn