免责声明:我已在 https://www.scrapewebapp.com/ 上为此特定用例构建了一个 API。因此,如果您想快速完成它,请使用它,否则请继续阅读。
让我们使用这个例子:假设我想从我的帐户 https://www.scrapewebapp.com/ 中抓取我自己的 API 密钥。在此页面上:https://app.scrapewebapp.com/account/api_key
首先,您需要找到登录页面。如果您尝试访问登录后的页面,大多数网站都会给您重定向 303,因此如果您尝试直接抓取 https://app.scrapewebapp.com/account/api_key,您将自动获取登录页面 https:// app.scrapewebapp.com/login。因此,如果尚未提供,这是自动查找登录页面的好方法。
好的,现在我们有了登录页面,我们需要找到添加用户名或电子邮件以及密码和实际登录按钮的位置。最好的方法是创建一个简单的脚本,使用类型“电子邮件”、“用户名”、“密码”查找输入的 ID,并查找类型为“提交”的按钮。我在下面为您编写了代码:
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
2。使用 Selenium 实际登录
现在您需要创建一个 selenium webdriver。我们将使用 chrome headless 来通过 Python 运行它。安装方法如下:
# Install selenium and chromium !pip install selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin import sys sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
然后实际登录我们的网站并保存 cookie。我们将保存所有 cookie,但您只能根据需要保存身份验证 cookie。
# Imports from selenium import webdriver from selenium.webdriver.common.by import By import requests import time # Set up Chrome options chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') # Initialize the WebDriver driver = webdriver.Chrome(options=chrome_options) try: # Open the login page driver.get("https://app.scrapewebapp.com/login") # Find the email input field by ID and input your email email_input = driver.find_element(By.ID, "email") email_input.send_keys("******@gmail.com") # Find the password input field by ID and input your password password_input = driver.find_element(By.ID, "password") password_input.send_keys("*******") # Find the login button and submit the form login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']") login_button.click() # Wait for the login process to complete time.sleep(5) # Adjust this depending on your site's response time finally: # Close the browser driver.quit()
就像通过 driver.getcookies() 函数将它们保存到字典中一样简单。
def save_cookies(driver): """Save cookies from the Selenium WebDriver into a dictionary.""" cookies = driver.get_cookies() cookie_dict = {} for cookie in cookies: cookie_dict[cookie['name']] = cookie['value'] return cookie_dict
从 WebDriver 保存 cookie
cookie = save_cookies(驱动程序)
在这部分中,我们将使用简单的库请求,但您也可以继续使用 selenium。
现在我们想从此页面获取实际的 API:https://app.scrapewebapp.com/account/api_key。
因此,我们从请求库创建一个会话并将每个 cookie 添加到其中。然后请求 URL 并打印响应文本。
def scrape_api_key(cookies): """Use cookies to scrape the /account/api_key page.""" url = 'https://app.scrapewebapp.com/account/api_key' # Set up the session to persist cookies session = requests.Session() # Add cookies from Selenium to the requests session for name, value in cookies.items(): session.cookies.set(name, value) # Make the request to the /account/api_key page response = session.get(url) # Check if the request is successful if response.status_code == 200: print("API Key page content:") print(response.text) # Print the page content (could contain the API key) else: print(f"Failed to retrieve API key page, status code: {response.status_code}")
我们得到了我们想要的页面文本,但是有很多我们不关心的数据。我们只想要 api_key。
最好、最简单的方法是使用像 ChatGPT(GPT4o 模型)这样的人工智能。
这样提示模型:“您是一名专家抓取工具,您只会提取从上下文中询问的信息。我需要来自 {context} 的 api-key 值”
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
如果您想要一个简单可靠的 API 来实现这一切,请尝试我的新产品 https://www.scrapewebapp.com/
如果你喜欢这篇文章,请给我鼓掌并关注我。确实有很大帮助!
以上是如何使用 Selenium 抓取受登录保护的网站(分步指南)的详细内容。更多信息请关注PHP中文网其他相关文章!