免責聲明:我已在 https://www.scrapewebapp.com/ 上為此特定用例建立了一個 API。因此,如果您想快速完成它,請使用它,否則請繼續閱讀。
讓我們使用這個範例:假設我想從我的帳戶 https://www.scrapewebapp.com/ 中抓取我自己的 API 金鑰。在此頁面:https://app.scrapewebapp.com/account/api_key
首先,您需要找到登入頁面。如果您嘗試造訪登入後的頁面,大多數網站都會為您重新導向303,因此如果您嘗試直接抓取https://app.scrapewebapp.com/account/api_key,您將自動取得登入頁面https:// app.scrapewebapp.com/login。因此,如果尚未提供,這是自動查找登入頁面的好方法。
好的,現在我們有了登入頁面,我們需要找到新增使用者名稱或電子郵件以及密碼和實際登入按鈕的位置。最好的方法是建立一個簡單的腳本,使用類型「電子郵件」、「使用者名稱」、「密碼」來尋找輸入的 ID,並尋找類型為「提交」的按鈕。我在下面為您編寫了程式碼:
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
2。使用 Selenium 實際登入
現在您需要建立一個 selenium webdriver。我們將使用 chrome headless 來透過 Python 運行它。安裝方法如下:
# Install selenium and chromium !pip install selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin import sys sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
然後實際登入我們的網站並儲存 cookie。我們將保存所有 cookie,但您只能根據需要儲存身分驗證 cookie。
# Imports from selenium import webdriver from selenium.webdriver.common.by import By import requests import time # Set up Chrome options chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') # Initialize the WebDriver driver = webdriver.Chrome(options=chrome_options) try: # Open the login page driver.get("https://app.scrapewebapp.com/login") # Find the email input field by ID and input your email email_input = driver.find_element(By.ID, "email") email_input.send_keys("******@gmail.com") # Find the password input field by ID and input your password password_input = driver.find_element(By.ID, "password") password_input.send_keys("*******") # Find the login button and submit the form login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']") login_button.click() # Wait for the login process to complete time.sleep(5) # Adjust this depending on your site's response time finally: # Close the browser driver.quit()
就像透過 driver.getcookies() 函數將它們保存到字典中一樣簡單。
def save_cookies(driver): """Save cookies from the Selenium WebDriver into a dictionary.""" cookies = driver.get_cookies() cookie_dict = {} for cookie in cookies: cookie_dict[cookie['name']] = cookie['value'] return cookie_dict
從 WebDriver 儲存 cookie
cookie = save_cookies(驅動程式)
在這部分中,我們將使用簡單的庫請求,但您也可以繼續使用 selenium。
現在我們想從此頁面取得實際的 API:https://app.scrapewebapp.com/account/api_key。
因此,我們從請求庫建立一個會話並將每個 cookie 新增到其中。然後請求 URL 並列印回應文字。
def scrape_api_key(cookies): """Use cookies to scrape the /account/api_key page.""" url = 'https://app.scrapewebapp.com/account/api_key' # Set up the session to persist cookies session = requests.Session() # Add cookies from Selenium to the requests session for name, value in cookies.items(): session.cookies.set(name, value) # Make the request to the /account/api_key page response = session.get(url) # Check if the request is successful if response.status_code == 200: print("API Key page content:") print(response.text) # Print the page content (could contain the API key) else: print(f"Failed to retrieve API key page, status code: {response.status_code}")
我們得到了我們想要的頁面文本,但是有很多我們不關心的數據。我們只想要 api_key。
最好、最簡單的方法是使用像 ChatGPT(GPT4o 模型)這樣的人工智慧。
這樣提示模型:「您是專家抓取工具,您只會提取從上下文中詢問的資訊。我需要來自 {context} 的 api-key 值」
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
如果您想要一個簡單可靠的 API 來實現這一切,請嘗試我的新產品 https://www.scrapewebapp.com/
如果你喜歡這篇文章,請給我鼓掌並關注我。確實有很大幫助!
以上是如何使用 Selenium 抓取受登入保護的網站(逐步指南)的詳細內容。更多資訊請關注PHP中文網其他相關文章!