首頁  >  問答  >  主體

上傳包含 HTML 頁面中的 URL 的 CSV 文件,並使用 Flask 讀取要抓取的 URL

我目前需要製作一個基於網路的系統,可以上傳包含 URL 清單的 CSV 檔案。上傳後,系統將逐行讀取 URL,並將用於下一步抓取。這裡,抓取需要先登入網站再抓取。我已經有了登入網站的源代碼。但是,問題是我想將名為“upload_page.html”的html頁面與名為“upload_csv.py”的燒瓶檔案連接起來。登入和抓取的原始程式碼應該放在flask檔案中的哪裡?

upload_page.html

<div class="upload">
            <h2>Upload a CSV file</h2>
                <form action="/upload" method="post" enctype="multipart/form-data">
                 <input type="file" name="file" accept=".csv">
                 <br>
                 <br>
                 <button type="submit">Upload</button>
                </form>
</div>

upload_csv.py

from flask import Flask, request, render_template
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
import json
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('upload_page.html')

#Code for Login to the website


@app.route('/upload', methods=['POST'])
def upload():
    # Read the uploaded file
    csv_file = request.files['file']
    # Load the CSV data into a DataFrameSS
    df = pd.read_csv(csv_file)
    final_data = []
    # Loop over the rows in the DataFrame and scrape each link
    for index, row in df.iterrows():
        link = row['Link']
        response = requests.get(link)
        soup = BeautifulSoup(response.content, 'html.parser')
        start = time.time()
        # will be used in the while loop
        initialScroll = 0
        finalScroll = 1000

        while True:
            driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
            # this command scrolls the window starting from the pixel value stored in the initialScroll
            # variable to the pixel value stored at the finalScroll variable
            initialScroll = finalScroll
            finalScroll += 1000

            # we will stop the script for 3 seconds so that the data can load
            time.sleep(2)
            end = time.time()
            # We will scroll for 20 seconds.
            if round(end - start) > 20:
                break

        src = driver.page_source
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        # print(soup.prettify())

        #Code to do scrape the website

    return render_template('index.html', message='Scraped all data')


if __name__ == '__main__':
    app.run(debug=True)

我的登入和抓取程式碼是否位於正確的位置?但是,編碼不起作用,在我單擊上傳按鈕後,它沒有被處理

P粉799885311P粉799885311408 天前523

全部回覆(1)我來回復

  • P粉207969787

    P粉2079697872023-09-08 00:06:27

    csv_file = request.files['file']
    # Load the CSV data into a DataFrame
    df = pd.read_csv(csv_file)
    final_data = []
    # Initialize the web driver
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    driver = webdriver.Chrome(options=chrome_options)
    # Loop over the rows in the DataFrame and scrape each link
    for index, row in df.iterrows():
        link = row['Link']
        # Login to the website
        # Replace this with your own login code
        driver.get("https://example.com/login")
        username_field = driver.find_element_by_name("username")
        password_field = driver.find_element_by_name("password")
        username_field.send_keys("myusername")
        password_field.send_keys("mypassword")
        password_field.send_keys(Keys.RETURN)
        # Wait for the login to complete
        WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))
        # Scrape the website
        driver.get(link)
        start = time.time()
        # will be used in the while loop
        initialScroll = 0
        finalScroll = 1000
    
        while True:
            driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
            # this command scrolls the window starting from the pixel value stored in the initialScroll
            # variable to the pixel value stored at the finalScroll variable
            initialScroll = finalScroll
            finalScroll += 1000
    
            # we will stop the script for 3 seconds so that the data can load
            time.sleep(2)
            end = time.time()
            # We will scroll for 20 seconds.
            if round(end - start) > 20:
                break

    回覆
    0
  • 取消回覆