首页  >  文章  >  后端开发  >  使用 Scrapy 和 Playwright 无限滚动抓取页面

使用 Scrapy 和 Playwright 无限滚动抓取页面

WBOY
WBOY原创
2024-08-10 06:58:331293浏览

使用 Scrapy 抓取网站时,您很快就会遇到各种需要发挥创意或与要抓取的页面进行交互的场景。其中一种场景是当您需要抓取无限滚动页面时。当您像社交媒体源一样向下滚动页面时,此类网站页面会加载更多内容。

抓取这些类型的页面的方法肯定不止一种。我最近解决这个问题的一种方法是继续滚动,直到页面长度停止增加(即滚动到底部)。这篇文章将逐步介绍这个过程。

本文假设您已设置并运行一个 Scrapy 项目,以及一个可以修改和运行的 Spider。

将 Playwright 与 Scrapy 结合使用

此集成使用 scrapy-playwright 插件将 Playwright for Python 与 Scrapy 集成。 Playwright 是一个无头浏览器自动化库,用于与网页交互并提取数据。

我一直在使用 uv 进行 Python 包安装和管理。

然后,我直接使用来自 uv 的虚拟环境:

uv venv 
source .venv/bin/activate

使用以下命令将 scrapy-playwright 插件和 Playwright 安装到您的虚拟环境中:

uv pip install scrapy-playwright

安装您想要与 Playwright 一起使用的浏览器。例如,要安装 Chromium,您可以运行以下命令:

playwright install chromium

如果需要,您还可以安装其他浏览器,例如 Firefox。

注意:以下 Scrapy 代码和 Playwright 集成仅使用 Chromium 进行了测试。

更新蜘蛛中的settings.py文件或custom_settings属性以包含DOWNLOAD_HANDLERS和PLAYWRIGHT_LAUNCH_OPTIONS设置。

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

PLAYWRIGHT_LAUNCH_OPTIONS = {
    # optional for CORS issues
    "args": [
        "--disable-web-security",
        "--disable-features=IsolateOrigins,site-per-process",
    ],
    # optional for debugging
    "headless": False,
},

对于 PLAYWRIGHT_LAUNCH_OPTIONS,您可以将 headless 选项设置为 False,以打开浏览器实例并观察进程运行。这有利于调试和构建初始抓取器。

处理 CORS 问题

我传入附加参数来禁用网络安全并隔离来源。当您抓取存在 CORS 问题的网站时,这非常有用。

例如,可能会出现由于 CORS 导致未加载所需 JavaScript 资源或未发出网络请求的情况。如果某些页面操作(例如单击按钮)未按预期工作但其他一切都正常,您可以通过检查浏览器控制台是否有错误来更快地隔离此问题。

"PLAYWRIGHT_LAUNCH_OPTIONS": {
    "args": [
        "--disable-web-security",
        "--disable-features=IsolateOrigins,site-per-process",
    ],
    "headless": False,
}

爬行无限滚动页面

这是一个爬行无限滚动页面的蜘蛛的示例。蜘蛛将页面滚动 700 像素,并等待 750 毫秒以完成请求。蜘蛛将继续滚动,直到到达页面底部,滚动位置在循环过程中不会改变。

我正在使用 custom_settings 修改蜘蛛本身的设置,以将设置保留在一处。您还可以将这些设置添加到settings.py 文件中。

# /<project>/spiders/infinite_scroll.py

import scrapy

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector


class InfinitePageSpider(CrawlSpider):
    """
    Spider to crawl an infinite scroll page
    """

    name = "infinite_scroll"

    allowed_domains = ["<allowed_domain>"]
    start_urls = ["<start_url>"]

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "args": [
                "--disable-web-security",
                "--disable-features=IsolateOrigins,site-per-process",
            ],
            "headless": False,
        },
        "LOG_LEVEL": "INFO",
    }

    def start_requests(self):
        yield scrapy.Request(
            url=f"{self.start_urls[0]}",
            meta=dict(
                playwright=True,
                playwright_include_page=True,
            ),
            callback=self.parse,
        )


    async def parse(
        self,
        response,
    ):
        page = response.meta["playwright_page"]
        page.set_default_timeout(10000)

        await page.wait_for_timeout(5000)
            try:
                last_position = await page.evaluate("window.scrollY")

                while True:
                    # scroll by 700 while not at the bottom
                    await page.evaluate("window.scrollBy(0, 700)")
                    await page.wait_for_timeout(750) # wait for 750ms for the request to complete
                    current_position = await page.evaluate("window.scrollY")

                    if current_position == last_position:
                        print("Reached the bottom of the page.")
                        break

                    last_position = current_position

            except Exception as error:
                print(f"Error: {error}")
                pass

            print("Getting content")
            content = await page.content()

            print("Parsing content")
            selector = Selector(text=content)

            print("Extracting links")
            links = selector.xpath("//a[contains(@href, '/<link-pattern>/')]//@href").getall()

            print(f"Found {len(links)} links...")

            print("Yielding links")

            for link in links:
                yield {"link": link}


我了解到的一件事是,没有两个页面或站点是相同的,因此您可能需要调整滚动量和等待时间以考虑页面以及请求的网络往返中的任何延迟完全的。您可以通过检查滚动位置和完成请求所需的时间以编程方式动态调整此设置。

在页面加载时,我等待资源加载和页面渲染的时间稍长一些。 Playwright 页面被传递到response.meta 对象中的解析回调方法。这用于与页面交互并滚动页面。这是在 scrapy.Request 参数中使用 playwright=True 和 playwright_include_page=True 选项指定的。

def start_requests(self):
    yield scrapy.Request(
        url=f"{self.start_urls[0]}",
        meta=dict(
            playwright=True,
            playwright_include_page=True,
        ),
        callback=self.parse,
    )

该蜘蛛将使用 page.evaluate 和scrollBy() JavaScript 方法将页面滚动 700 像素,然后等待 750 毫秒以完成请求。然后,将 Playwright 页面内容复制到 Scrapy 选择器,并从页面中提取链接。然后,这些链接将被传送到 Scrapy 管道以继续处理。

Crawling Pages with Infinite Scroll using Scrapy and Playwright

无限滚动 Scrapy 剧作家

对于页面请求开始加载重复内容的情况,您可以添加检查以查看内容是否已经加载,然后跳出循环。或者,如果您知道滚动加载的数量,则可以添加一个计数器,以便在一定数量的滚动加上/减去缓冲区后跳出循环。

Infinite Scroll with an Element Click

It's also possible that the page may have an element that you can scroll to (i.e. "Load more") that will trigger the next set of content to load. You can use the page.evaluate method to scroll to the element and then click it to load the next set of content.

...

try:
    while True:
        button = page.locator('//button[contains(., "Load more")]')
        await button.wait_for()

        if not button:
            print("No 'Load more' button found.")
            break

        is_disabled = await button.is_disabled()
        if is_disabled:
            print("Button is disabled.")
            break

        await button.scroll_into_view_if_needed()
        await button.click()
        await page.wait_for_timeout(750)

except Exception as error:
    print(f"Error: {error}")
    pass

...

This method is useful when you know the page has a button that will load the next set of content. You can also use this method to click on other elements that will trigger the next set of content to load. The scroll_into_view_if_needed method will scroll the button or element into view if it is not already visible on the page. This is one of those scenarios when you will want to double-check the page actions with headless=False to see if the button is being clicked and the content is being loaded as expected before running a full crawl.

Note: As mentioned above, confirm that the page assets(.js) are loading correctly and that the network requests are being made so that the button (or element) is mounted and clickable.

Wrapping Up

Web crawling is a case-by-case scenario and you will need to adjust the code to fit the page that you are trying to scrape. The above code is a starting point to get you going with crawling infinite scroll pages with Scrapy and Playwright.

Hopefully, this helps to get you unblocked! ?

Subscribe to get my latest content by email -> Newsletter

以上是使用 Scrapy 和 Playwright 无限滚动抓取页面的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn