搜尋

首頁  >  問答  >  主體

上傳包含 HTML 頁面中的 URL 的 CSV 文件,並使用 Flask 讀取要抓取的 URL

我目前需要製作一個基於網路的系統,可以上傳包含 URL 清單的 CSV 檔案。上傳後,系統將逐行讀取 URL,並將用於下一步抓取。這裡,抓取需要先登入網站再抓取。我已經有了登入網站的源代碼。但是,問題是我想將名為“upload_page.html”的html頁面與名為“upload_csv.py”的燒瓶檔案連接起來。登入和抓取的原始程式碼應該放在flask檔案中的哪裡?

upload_page.html

1

2

3

4

5

6

7

8

9

<div class="upload">

            <h2>Upload a CSV file</h2>

                <form action="/upload" method="post" enctype="multipart/form-data">

                 <input type="file" name="file" accept=".csv">

                 <br>

                 <br>

                 <button type="submit">Upload</button>

                </form>

</div>

upload_csv.py

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

from flask import Flask, request, render_template

import pandas as pd

import requests

from bs4 import BeautifulSoup

import csv

import json

import time

from time import sleep

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.chrome.options import Options

 

app = Flask(__name__)

 

@app.route('/')

def index():

    return render_template('upload_page.html')

 

#Code for Login to the website

 

 

@app.route('/upload', methods=['POST'])

def upload():

    # Read the uploaded file

    csv_file = request.files['file']

    # Load the CSV data into a DataFrameSS

    df = pd.read_csv(csv_file)

    final_data = []

    # Loop over the rows in the DataFrame and scrape each link

    for index, row in df.iterrows():

        link = row['Link']

        response = requests.get(link)

        soup = BeautifulSoup(response.content, 'html.parser')

        start = time.time()

        # will be used in the while loop

        initialScroll = 0

        finalScroll = 1000

 

        while True:

            driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")

            # this command scrolls the window starting from the pixel value stored in the initialScroll

            # variable to the pixel value stored at the finalScroll variable

            initialScroll = finalScroll

            finalScroll += 1000

 

            # we will stop the script for 3 seconds so that the data can load

            time.sleep(2)

            end = time.time()

            # We will scroll for 20 seconds.

            if round(end - start) > 20:

                break

 

        src = driver.page_source

        soup = BeautifulSoup(driver.page_source, 'html.parser')

        # print(soup.prettify())

 

        #Code to do scrape the website

 

    return render_template('index.html', message='Scraped all data')

 

 

if __name__ == '__main__':

    app.run(debug=True)

我的登入和抓取程式碼是否位於正確的位置?但是,編碼不起作用,在我單擊上傳按鈕後,它沒有被處理

P粉799885311P粉799885311565 天前708

全部回覆(1)我來回復

  • P粉207969787

    P粉2079697872023-09-08 00:06:27

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    csv_file = request.files['file']

    # Load the CSV data into a DataFrame

    df = pd.read_csv(csv_file)

    final_data = []

    # Initialize the web driver

    chrome_options = Options()

    chrome_options.add_argument("--headless")

    chrome_options.add_argument("--disable-gpu")

    driver = webdriver.Chrome(options=chrome_options)

    # Loop over the rows in the DataFrame and scrape each link

    for index, row in df.iterrows():

        link = row['Link']

        # Login to the website

        # Replace this with your own login code

        driver.get("https://example.com/login")

        username_field = driver.find_element_by_name("username")

        password_field = driver.find_element_by_name("password")

        username_field.send_keys("myusername")

        password_field.send_keys("mypassword")

        password_field.send_keys(Keys.RETURN)

        # Wait for the login to complete

        WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))

        # Scrape the website

        driver.get(link)

        start = time.time()

        # will be used in the while loop

        initialScroll = 0

        finalScroll = 1000

     

        while True:

            driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")

            # this command scrolls the window starting from the pixel value stored in the initialScroll

            # variable to the pixel value stored at the finalScroll variable

            initialScroll = finalScroll

            finalScroll += 1000

     

            # we will stop the script for 3 seconds so that the data can load

            time.sleep(2)

            end = time.time()

            # We will scroll for 20 seconds.

            if round(end - start) > 20:

                break

    回覆
    0
  • 取消回覆