Home  >  Article  >  Backend Development  >  How to implement website status checking using Python Asyncio

How to implement website status checking using Python Asyncio

PHPz
PHPzforward
2023-04-21 14:10:18752browse

We can use asyncio to query the HTTP status of a website by opening a stream and writing and reading HTTP requests and responses.

Then we can use asyncio to concurrently query the status of multiple websites and even dynamically report the results.

1. How to use Asyncio to check HTTP status

The asyncio module provides support for opening socket connections and reading and writing data through streams. We can use this feature to check the status of a web page.

This may involve four steps, they are:

  • Open a connection

  • Write a request

  • Read response

  • Close connection

2. Open HTTP connection

Can be used The asyncio.open_connection() function opens a connection in asyncio. Among many parameters, the function takes a string hostname and an integer port number.

This is a must-wait coroutine that returns a StreamReader and a StreamWriter for reading and writing using the socket.

This can be used to open an HTTP connection on port 80.

...
# open a socket connection
reader, writer = await asyncio.open_connection('www.google.com', 80)

We can also use the ssl=True parameter to open an SSL connection. This can be used to open an HTTPS connection on port 443.

...
# open a socket connection
reader, writer = await asyncio.open_connection('www.google.com', 443)

3. Write HTTP request

After opening, we can write queries to StreamWriter to make HTTP requests. For example, HTTP version 1.1 requests are in plain text. We can request the file path "/", which may look like this:

GET / HTTP/1.1
Host: www.google.com

Importantly, there must be a carriage return and a line feed (\r\n) at the end of each line, and a blank line at the end .

As a Python string, this might look like this:

'GET / HTTP/1.1\r\n'
'Host: www.google.com\r\n'
'\r\n'

This string must be encoded into bytes before being written to the StreamWriter. This can be achieved by using the encode() method on the string itself. The default "utf-8" encoding may be sufficient.

...
# encode string as bytes
byte_data = string.encode()

The bytes can then be written to the socket via the StreamWriter's write() method.

...
# write query to socket
writer.write(byte_data)

After writing a request, it is best to wait for the bytes of data to be sent and for the socket to be ready. This can be achieved through the drain() method. This is a coroutine that must wait.

...
# wait for the socket to be ready.
await writer.drain()

4. Read the HTTP response

After making the HTTP request, we can read the response. This can be achieved through the socket's StreamReader. The response can be read using the read() method, which reads a large block of bytes, or the readline() method, which reads a line of bytes.

We may prefer the readline() method because we are using the text-based HTTP protocol, which sends HTML data one line at a time. The readline() method is a coroutine and must wait.

...
# read one line of response
line_bytes = await reader.readline()

HTTP 1.1 responses consist of two parts, a header separated by a blank line, followed by a body terminated by a blank line. The header contains information about whether the request was successful and what type of file will be sent, and the body contains the content of the file, such as an HTML web page.

The first line of the HTTP header contains the HTTP status of the requested page on the server. Each line must be decoded from bytes to string.

This can be achieved by using the decode() method on byte data. Again, the default encoding is "utf_8".

...
# decode bytes into a string
line_data = line_bytes.decode()

5. Close the HTTP connection

We can close the socket connection by closing the StreamWriter. This can be achieved by calling the close() method.

...
# close the connection
writer.close()

This does not block and may not close the socket immediately. Now that we know how to make HTTP requests and read responses using asyncio, let's look at some examples of checking the status of a web page.

6. Example of sequentially checking HTTP status

We can develop an example to check the HTTP status of multiple websites using asyncio.

In this example, we will first develop a coroutine to check the status of a given URL. We will then call this coroutine once for each of the top 10 websites.

First, we can define a coroutine that will accept a URL string and return HTTP status.

# get the HTTP/S status of a webpage
async def get_status(url):
	# ...

The URL must be parsed into its component parts. We need the hostname and file path when making an HTTP request. We also need to know the URL scheme (HTTP or HTTPS) to determine if SSL is required.

This can be achieved using the urllib.parse.urlsplit() function, which accepts a URL string and returns a named tuple of all URL elements.

...
# split the url into components
url_parsed = urlsplit(url)

Then we can open an HTTP connection based on the URL scheme and use the URL hostname.

...
# open the connection
if url_parsed.scheme == 'https':
    reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
else:
    reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)

Next, we can create an HTTP GET request using the hostname and file path, and use a StreamWriter to write the encoded bytes to the socket.

...
# send GET request
query = f'GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n'
# write query to socket
writer.write(query.encode())
# wait for the bytes to be written to the socket
await writer.drain()

Next, we can read the HTTP response. We only need the first line of the response containing the HTTP status.

...
# read the single line response
response = await reader.readline()

The connection can then be closed.

...
# close the connection
writer.close()

Finally, we can decode the bytes read from the server, remote trailing whitespace, and return HTTP status.

...
# decode and strip white space
status = response.decode().strip()
# return the response
return status

Combining them together, the complete get_status() coroutine is listed below. It does not have any error handling, such as inaccessible host or slow response. These additions will provide a nice extension for readers.

# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == 'https':
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f'GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n'
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status

Next, we can call the get_status() coroutine for multiple web pages or websites we want to check. In this case, we will define a list of the top 10 web pages in the world.

...
# list of top 10 websites to check
sites = ['https://www.google.com/',
    'https://www.youtube.com/',
    'https://www.facebook.com/',
    'https://twitter.com/',
    'https://www.instagram.com/',
    'https://www.baidu.com/',
    'https://www.wikipedia.org/',
    'https://yandex.ru/',
    'https://yahoo.com/',
    'https://www.whatsapp.com/'
    ]

然后我们可以使用我们的 get_status() 协程依次查询每个。在这种情况下,我们将在一个循环中按顺序这样做,并依次报告每个状态。

...
# check the status of all websites
for url in sites:
    # get the status for the url
    status = await get_status(url)
    # report the url and its status
    print(f'{url:30}:\t{status}')

在使用 asyncio 时,我们可以做得比顺序更好,但这提供了一个很好的起点,我们可以在以后进行改进。将它们结合在一起,main() 协程查询前 10 个网站的状态。

# main coroutine
async def main():
    # list of top 10 websites to check
    sites = ['https://www.google.com/',
        'https://www.youtube.com/',
        'https://www.facebook.com/',
        'https://twitter.com/',
        'https://www.instagram.com/',
        'https://www.baidu.com/',
        'https://www.wikipedia.org/',
        'https://yandex.ru/',
        'https://yahoo.com/',
        'https://www.whatsapp.com/'
        ]
    # check the status of all websites
    for url in sites:
        # get the status for the url
        status = await get_status(url)
        # report the url and its status
        print(f'{url:30}:\t{status}')

最后,我们可以创建 main() 协程并将其用作 asyncio 程序的入口点。

...
# run the asyncio program
asyncio.run(main())

将它们结合在一起,下面列出了完整的示例。

# SuperFastPython.com
# check the status of many webpages
import asyncio
from urllib.parse import urlsplit
 
# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == 'https':
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f'GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n'
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status
 
# main coroutine
async def main():
    # list of top 10 websites to check
    sites = ['https://www.google.com/',
        'https://www.youtube.com/',
        'https://www.facebook.com/',
        'https://twitter.com/',
        'https://www.instagram.com/',
        'https://www.baidu.com/',
        'https://www.wikipedia.org/',
        'https://yandex.ru/',
        'https://yahoo.com/',
        'https://www.whatsapp.com/'
        ]
    # check the status of all websites
    for url in sites:
        # get the status for the url
        status = await get_status(url)
        # report the url and its status
        print(f'{url:30}:\t{status}')
 
# run the asyncio program
asyncio.run(main())

运行示例首先创建 main() 协程并将其用作程序的入口点。main() 协程运行,定义前 10 个网站的列表。然后顺序遍历网站列表。 main()协程挂起调用get_status()协程查询一个网站的状态。

get_status() 协程运行、解析 URL 并打开连接。它构造一个 HTTP GET 查询并将其写入主机。读取、解码并返回响应。main() 协程恢复并报告 URL 的 HTTP 状态。

对列表中的每个 URL 重复此操作。该程序大约需要 5.6 秒才能完成,或者平均每个 URL 大约需要半秒。这突出了我们如何使用 asyncio 来查询网页的 HTTP 状态。

尽管如此,它并没有充分利用 asyncio 来并发执行任务。

https://www.google.com/       :    HTTP/1.1 200 OK
https://www.youtube.com/      :    HTTP/1.1 200 OK
https://www.facebook.com/     :    HTTP/1.1 302 Found
https://twitter.com/          :    HTTP/1.1 200 OK
https://www.instagram.com/    :    HTTP/1.1 200 OK
https://www.baidu.com/        :    HTTP/1.1 200 OK
https://www.wikipedia.org/    :    HTTP/1.1 200 OK
https://yandex.ru/            :    HTTP/1.1 302 Moved temporarily
https://yahoo.com/            :    HTTP/1.1 301 Moved Permanently
https://www.whatsapp.com/     :    HTTP/1.1 302 Found

7. 并发查看网站状态示例

asyncio 的一个好处是我们可以同时执行许多协程。我们可以使用 asyncio.gather() 函数在 asyncio 中并发查询网站的状态。

此函数采用一个或多个协程,暂停执行提供的协程,并将每个协程的结果作为可迭代对象返回。然后我们可以遍历 URL 列表和可迭代的协程返回值并报告结果。

这可能是比上述方法更简单的方法。首先,我们可以创建一个协程列表。

...
# create all coroutine requests
coros = [get_status(url) for url in sites]

接下来,我们可以执行协程并使用 asyncio.gather() 获取可迭代的结果。

请注意,我们不能直接提供协程列表,而是必须将列表解压缩为单独的表达式,这些表达式作为位置参数提供给函数。

...
# execute all coroutines and wait
results = await asyncio.gather(*coros)

这将同时执行所有协程并检索它们的结果。然后我们可以遍历 URL 列表和返回状态并依次报告每个。

...
# process all results
for url, status in zip(sites, results):
    # report status
    print(f'{url:30}:\t{status}')

将它们结合在一起,下面列出了完整的示例。

# SuperFastPython.com
# check the status of many webpages
import asyncio
from urllib.parse import urlsplit
 
# get the HTTP/S status of a webpage
async def get_status(url):
    # split the url into components
    url_parsed = urlsplit(url)
    # open the connection
    if url_parsed.scheme == 'https':
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 443, ssl=True)
    else:
        reader, writer = await asyncio.open_connection(url_parsed.hostname, 80)
    # send GET request
    query = f'GET {url_parsed.path} HTTP/1.1\r\nHost: {url_parsed.hostname}\r\n\r\n'
    # write query to socket
    writer.write(query.encode())
    # wait for the bytes to be written to the socket
    await writer.drain()
    # read the single line response
    response = await reader.readline()
    # close the connection
    writer.close()
    # decode and strip white space
    status = response.decode().strip()
    # return the response
    return status
 
# main coroutine
async def main():
    # list of top 10 websites to check
    sites = ['https://www.google.com/',
        'https://www.youtube.com/',
        'https://www.facebook.com/',
        'https://twitter.com/',
        'https://www.instagram.com/',
        'https://www.baidu.com/',
        'https://www.wikipedia.org/',
        'https://yandex.ru/',
        'https://yahoo.com/',
        'https://www.whatsapp.com/'
        ]
    # create all coroutine requests
    coros = [get_status(url) for url in sites]
    # execute all coroutines and wait
    results = await asyncio.gather(*coros)
    # process all results
    for url, status in zip(sites, results):
        # report status
        print(f'{url:30}:\t{status}')
 
# run the asyncio program
asyncio.run(main())

运行该示例会像以前一样执行 main() 协程。在这种情况下,协程列表是在列表理解中创建的。

然后调用 asyncio.gather() 函数,传递协程并挂起 main() 协程,直到它们全部完成。协程执行,同时查询每个网站并返回它们的状态。

main() 协程恢复并接收可迭代的状态值。然后使用 zip() 内置函数遍历此可迭代对象和 URL 列表,并报告状态。

这突出了一种更简单的方法来同时执行协程并在所有任务完成后报告结果。它也比上面的顺序版本更快,在我的系统上完成大约 1.4 秒。

https://www.google.com/       :    HTTP/1.1 200 OK
https://www.youtube.com/      :    HTTP/1.1 200 OK
https://www.facebook.com/     :    HTTP/1.1 302 Found
https://twitter.com/          :    HTTP/1.1 200 OK
https://www.instagram.com/    :    HTTP/1.1 200 OK
https://www.baidu.com/        :    HTTP/1.1 200 OK
https://www.wikipedia.org/    :    HTTP/1.1 200 OK
https://yandex.ru/            :    HTTP/1.1 302 Moved temporarily
https://yahoo.com/            :    HTTP/1.1 301 Moved Permanently
https://www.whatsapp.com/     :    HTTP/1.1 302 Found

The above is the detailed content of How to implement website status checking using Python Asyncio. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete