Rumah >pembangunan bahagian belakang >Tutorial Python >Bagaimana untuk melaksanakan semakan status laman web menggunakan Python Asyncio

Bagaimana untuk melaksanakan semakan status laman web menggunakan Python Asyncio

PHPzke hadapan: 2023-04-21 14:10:18891semak imbas

Kami boleh menggunakan asyncio untuk menanyakan status HTTP tapak web dengan membuka strim dan menulis serta membaca permintaan dan respons HTTP.

Kemudian kita boleh menggunakan asyncio untuk menanyakan status berbilang tapak web secara serentak dan juga melaporkan hasilnya secara dinamik.

1. Cara menggunakan Asyncio untuk menyemak status HTTP

Modul asyncio menyediakan sokongan untuk membuka sambungan soket dan membaca serta menulis data melalui strim. Kita boleh menggunakan ciri ini untuk menyemak status halaman web.

Ini mungkin melibatkan empat langkah, iaitu:

Buka sambungan
Tulis permintaan
Baca respons
Tutup sambungan

Sambungan HTTP Terbuka

boleh digunakan Fungsi asyncio.open_connection() membuka sambungan dalam asyncio. Di antara banyak parameter, fungsi mengambil nama hos rentetan dan nombor port integer.

Ini adalah coroutine yang mesti ditunggu yang mengembalikan StreamReader dan StreamWriter untuk membaca dan menulis menggunakan soket.

Ini boleh digunakan untuk membuka sambungan HTTP pada port 80.

...
# open a socket connection
reader, writer = await asyncio.open_connection(&#39;www.google.com&#39;, 80)

Kami juga boleh membuka sambungan SSL menggunakan parameter ssl=True. Ini boleh digunakan untuk membuka sambungan HTTPS pada port 443.

...
# open a socket connection
reader, writer = await asyncio.open_connection(&#39;www.google.com&#39;, 443)

3. Tulis permintaan HTTP

Selepas dibuka, kami boleh menulis pertanyaan kepada StreamWriter untuk membuat permintaan HTTP. Sebagai contoh, permintaan HTTP versi 1.1 adalah dalam teks biasa. Kami boleh meminta laluan fail "/", yang mungkin kelihatan seperti ini:

GET / HTTP/1.1
Host: www.google.com

Yang penting, setiap baris mesti mempunyai pemulangan pengangkutan dan suapan baris (rn) pada penghujungnya, dan baris kosong di penghujungnya.

Sebagai rentetan Python, ini mungkin kelihatan seperti ini:

&#39;GET / HTTP/1.1\r\n&#39;
&#39;Host: www.google.com\r\n&#39;
&#39;\r\n&#39;

Rentetan ini mesti dikodkan ke dalam bait sebelum ditulis kepada StreamWriter. Ini boleh dicapai dengan menggunakan kaedah encode() pada rentetan itu sendiri. Pengekodan lalai "utf-8" mungkin mencukupi.

...
# encode string as bytes
byte_data = string.encode()

Bait kemudiannya boleh ditulis ke soket melalui kaedah write() StreamWriter.

...
# write query to socket
writer.write(byte_data)

Selepas menulis permintaan, sebaiknya tunggu bait data dihantar dan soket siap. Ini boleh dicapai melalui kaedah drain(). Ini adalah coroutine yang mesti menunggu.

...
# wait for the socket to be ready.
await writer.drain()

4 Baca respons HTTP

Selepas membuat permintaan HTTP, kita boleh membaca respons. Ini boleh dicapai melalui StreamReader soket. Respons boleh dibaca menggunakan kaedah read(), yang membaca blok besar bait, atau kaedah readline(), yang membaca baris bait.

Kami mungkin lebih suka kaedah readline() kerana kami menggunakan protokol HTTP berasaskan teks, yang menghantar data HTML satu baris pada satu masa. Kaedah readline() ialah coroutine dan mesti menunggu.

...
# read one line of response
line_bytes = await reader.readline()

Respons HTTP 1.1 terdiri daripada dua bahagian, satu pengepala dipisahkan dengan baris kosong, diikuti dengan badan yang ditamatkan dengan baris kosong. Pengepala mengandungi maklumat tentang sama ada permintaan itu berjaya dan jenis fail yang akan dihantar, dan badan mengandungi kandungan fail, seperti halaman web HTML.

Baris pertama pengepala HTTP mengandungi status HTTP halaman yang diminta pada pelayan. Setiap baris mesti dinyahkodkan daripada bait kepada rentetan.

Ini boleh dicapai dengan menggunakan kaedah nyahkod() pada data bait. Sekali lagi, pengekodan lalai ialah "utf_8".

...
# decode bytes into a string
line_data = line_bytes.decode()

5 Tutup sambungan HTTP

Kami boleh menutup sambungan soket dengan menutup StreamWriter. Ini boleh dicapai dengan memanggil kaedah close().

...
# close the connection
writer.close()

Ini tidak menyekat dan mungkin tidak menutup soket serta-merta. Sekarang setelah kita tahu cara membuat permintaan HTTP dan membaca respons menggunakan asyncio, mari lihat beberapa contoh menyemak status halaman web.

6. Contoh menyemak status HTTP secara berurutan

Kami boleh membangunkan contoh untuk menyemak status HTTP berbilang tapak web menggunakan asyncio.

Dalam contoh ini, kami mula-mula akan membangunkan coroutine untuk menyemak status URL yang diberikan. Kami kemudian akan memanggil coroutine ini sekali untuk setiap 10 tapak web teratas.

Pertama, kita boleh menentukan coroutine yang akan menerima rentetan URL dan mengembalikan status HTTP.

# get the HTTP/S status of a webpage
async def get_status(url):
	# ...

URL mesti dihuraikan ke bahagian komponennya. Kami memerlukan nama hos dan laluan fail semasa membuat permintaan HTTP. Kami juga perlu mengetahui skema URL (HTTP atau HTTPS) untuk menentukan sama ada SSL diperlukan.

Ini boleh dicapai menggunakan fungsi urllib.parse.urlsplit(), yang menerima rentetan URL dan mengembalikan tuple bernama semua elemen URL.

...
# split the url into components
url_parsed = urlsplit(url)

Kami kemudiannya boleh membuka sambungan HTTP berdasarkan skema URL dan menggunakan nama hos URL.

...
# open the connection
if url_parsed.scheme == &#39;https&#39;:
    reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
else:
    reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)

Seterusnya, kita boleh membuat permintaan HTTP GET menggunakan nama hos dan laluan fail, dan menggunakan StreamWriter untuk menulis bait yang dikodkan pada soket.

...
# send GET request
query = f&#39;GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n&#39;
# write query to socket
writer.write(query.encode())
# wait for the bytes to be written to the socket
await writer.drain()

Seterusnya, kita boleh membaca respons HTTP. Kami hanya memerlukan baris pertama respons yang mengandungi status HTTP.

...
# read the single line response
response = await reader.readline()

Sambungan kemudiannya boleh ditutup.

...
# close the connection
writer.close()

Akhir sekali, kami boleh menyahkod bait yang dibaca daripada pelayan, ruang kosong mengekor jauh dan mengembalikan status HTTP.

...
# decode and strip white space
status = response.decode().strip()
# return the response
return status

Menggabungkannya bersama-sama, coroutine get_status() yang lengkap disenaraikan di bawah. Ia tidak mempunyai sebarang pengendalian ralat, seperti hos yang tidak boleh diakses atau tindak balas yang perlahan. Penambahan ini akan memberikan sambungan yang bagus untuk pembaca.

# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == &#39;https&#39;:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f&#39;GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n&#39;
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status

Seterusnya, kami boleh memanggil coroutine get_status() untuk berbilang halaman web atau tapak web yang ingin kami semak. Dalam kes ini, kami akan menentukan senarai 10 halaman web teratas di dunia.

...
# list of top 10 websites to check
sites = [&#39;https://www.google.com/&#39;,
    &#39;https://www.youtube.com/&#39;,
    &#39;https://www.facebook.com/&#39;,
    &#39;https://twitter.com/&#39;,
    &#39;https://www.instagram.com/&#39;,
    &#39;https://www.baidu.com/&#39;,
    &#39;https://www.wikipedia.org/&#39;,
    &#39;https://yandex.ru/&#39;,
    &#39;https://yahoo.com/&#39;,
    &#39;https://www.whatsapp.com/&#39;
    ]

然后我们可以使用我们的 get_status() 协程依次查询每个。在这种情况下，我们将在一个循环中按顺序这样做，并依次报告每个状态。

...
# check the status of all websites
for url in sites:
    # get the status for the url
    status = await get_status(url)
    # report the url and its status
    print(f&#39;{url:30}:\t{status}&#39;)

在使用 asyncio 时，我们可以做得比顺序更好，但这提供了一个很好的起点，我们可以在以后进行改进。将它们结合在一起，main() 协程查询前 10 个网站的状态。

# main coroutine
async def main():
    # list of top 10 websites to check
    sites = [&#39;https://www.google.com/&#39;,
        &#39;https://www.youtube.com/&#39;,
        &#39;https://www.facebook.com/&#39;,
        &#39;https://twitter.com/&#39;,
        &#39;https://www.instagram.com/&#39;,
        &#39;https://www.baidu.com/&#39;,
        &#39;https://www.wikipedia.org/&#39;,
        &#39;https://yandex.ru/&#39;,
        &#39;https://yahoo.com/&#39;,
        &#39;https://www.whatsapp.com/&#39;
        ]
    # check the status of all websites
    for url in sites:
        # get the status for the url
        status = await get_status(url)
        # report the url and its status
        print(f&#39;{url:30}:\t{status}&#39;)

最后，我们可以创建 main() 协程并将其用作 asyncio 程序的入口点。

...
# run the asyncio program
asyncio.run(main())

将它们结合在一起，下面列出了完整的示例。

# SuperFastPython.com
# check the status of many webpages
import asyncio
from urllib.parse import urlsplit
 
# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == &#39;https&#39;:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f&#39;GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n&#39;
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status
 
# main coroutine
async def main():
    # list of top 10 websites to check
    sites = [&#39;https://www.google.com/&#39;,
        &#39;https://www.youtube.com/&#39;,
        &#39;https://www.facebook.com/&#39;,
        &#39;https://twitter.com/&#39;,
        &#39;https://www.instagram.com/&#39;,
        &#39;https://www.baidu.com/&#39;,
        &#39;https://www.wikipedia.org/&#39;,
        &#39;https://yandex.ru/&#39;,
        &#39;https://yahoo.com/&#39;,
        &#39;https://www.whatsapp.com/&#39;
        ]
    # check the status of all websites
    for url in sites:
        # get the status for the url
        status = await get_status(url)
        # report the url and its status
        print(f&#39;{url:30}:\t{status}&#39;)
 
# run the asyncio program
asyncio.run(main())

运行示例首先创建 main() 协程并将其用作程序的入口点。main() 协程运行，定义前 10 个网站的列表。然后顺序遍历网站列表。 main()协程挂起调用get_status()协程查询一个网站的状态。

get_status() 协程运行、解析 URL 并打开连接。它构造一个 HTTP GET 查询并将其写入主机。读取、解码并返回响应。main() 协程恢复并报告 URL 的 HTTP 状态。

对列表中的每个 URL 重复此操作。该程序大约需要 5.6 秒才能完成，或者平均每个 URL 大约需要半秒。这突出了我们如何使用 asyncio 来查询网页的 HTTP 状态。

尽管如此，它并没有充分利用 asyncio 来并发执行任务。

https://www.google.com/ :   HTTP/1.1 200 OK
https://www.youtube.com/ :   HTTP/1.1 200 OK
https://www.facebook.com/ :   HTTP/1.1 302 Found
https://twitter.com/ :   HTTP/1.1 200 OK
https://www.instagram.com/ :   HTTP/1.1 200 OK
https://www.baidu.com/ :   HTTP/1.1 200 OK
https://www.wikipedia.org/ :   HTTP/1.1 200 OK
https://yandex.ru/ :   HTTP/1.1 302 Moved temporarily
https://yahoo.com/ :   HTTP/1.1 301 Moved Permanently
https://www.whatsapp.com/ :   HTTP/1.1 302 Found

7. 并发查看网站状态示例

asyncio 的一个好处是我们可以同时执行许多协程。我们可以使用 asyncio.gather() 函数在 asyncio 中并发查询网站的状态。

此函数采用一个或多个协程，暂停执行提供的协程，并将每个协程的结果作为可迭代对象返回。然后我们可以遍历 URL 列表和可迭代的协程返回值并报告结果。

这可能是比上述方法更简单的方法。首先，我们可以创建一个协程列表。

...
# create all coroutine requests
coros = [get_status(url) for url in sites]

接下来，我们可以执行协程并使用 asyncio.gather() 获取可迭代的结果。

请注意，我们不能直接提供协程列表，而是必须将列表解压缩为单独的表达式，这些表达式作为位置参数提供给函数。

...
# execute all coroutines and wait
results = await asyncio.gather(*coros)

这将同时执行所有协程并检索它们的结果。然后我们可以遍历 URL 列表和返回状态并依次报告每个。

...
# process all results
for url, status in zip(sites, results):
    # report status
    print(f&#39;{url:30}:\t{status}&#39;)

将它们结合在一起，下面列出了完整的示例。

# SuperFastPython.com
# check the status of many webpages
import asyncio
from urllib.parse import urlsplit
 
# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == &#39;https&#39;:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f&#39;GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n&#39;
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status
 
# main coroutine
async def main():
    # list of top 10 websites to check
    sites = ['https://www.google.com/',
        'https://www.youtube.com/',
        'https://www.facebook.com/',
        'https://twitter.com/',
        'https://www.instagram.com/',
        'https://www.baidu.com/',
        'https://www.wikipedia.org/',
        'https://yandex.ru/',
        'https://yahoo.com/',
        'https://www.whatsapp.com/'
        ]
    # create all coroutine requests
    coros = [get_status(url) for url in sites]
    # execute all coroutines and wait
    results = await asyncio.gather(*coros)
    # process all results
    for url, status in zip(sites, results):
        # report status
        print(f'{url:30}:\t{status}')
 
# run the asyncio program
asyncio.run(main())

运行该示例会像以前一样执行 main() 协程。在这种情况下，协程列表是在列表理解中创建的。

然后调用 asyncio.gather() 函数，传递协程并挂起 main() 协程，直到它们全部完成。协程执行，同时查询每个网站并返回它们的状态。

main() 协程恢复并接收可迭代的状态值。然后使用 zip() 内置函数遍历此可迭代对象和 URL 列表，并报告状态。

这突出了一种更简单的方法来同时执行协程并在所有任务完成后报告结果。它也比上面的顺序版本更快，在我的系统上完成大约 1.4 秒。

https://www.google.com/ :   HTTP/1.1 200 OK
https://www.youtube.com/ :   HTTP/1.1 200 OK
https://www.facebook.com/ :   HTTP/1.1 302 Found
https://twitter.com/ :   HTTP/1.1 200 OK
https://www.instagram.com/ :   HTTP/1.1 200 OK
https://www.baidu.com/ :   HTTP/1.1 200 OK
https://www.wikipedia.org/ :   HTTP/1.1 200 OK
https://yandex.ru/ :   HTTP/1.1 302 Moved temporarily
https://yahoo.com/ :   HTTP/1.1 301 Moved Permanently
https://www.whatsapp.com/ :   HTTP/1.1 302 Found

Atas ialah kandungan terperinci Bagaimana untuk melaksanakan semakan status laman web menggunakan Python Asyncio. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Python html 字符串循环并发对象 http https ssl

Kenyataan：

Artikel ini dikembalikan pada:yisu.com. Jika ada pelanggaran, sila hubungi admin@php.cn Padam

Artikel sebelumnya：Bagaimana untuk memadam atau memindahkan imej tertentu dalam kelompok dengan Python?Artikel seterusnya：Bagaimana untuk memadam atau memindahkan imej tertentu dalam kelompok dengan Python?

Artikel berkaitan

Lihat lagi