Home  >  Article  >  Backend Development  >  Detailed explanation of http and https protocols in python crawler (picture and text)

Detailed explanation of http and https protocols in python crawler (picture and text)

不言
不言Original
2018-09-15 15:02:442857browse

This article brings you a detailed explanation (pictures and texts) of the http and https protocols in python crawlers. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

1.HTTP protocol

1.Official concept:

HTTP protocol is the abbreviation of Hyper Text Transfer Protocol (Hyper Text Transfer Protocol), which is used to transfer data from The transport protocol used by World Wide Web (WWW: World Wide Web) servers to transmit hypertext to local browsers. (Although children’s shoes are blind to this concept, there is nothing they can do about it. After all, this is the authoritative and official conceptual explanation of HTTP. If you want to fully understand it, please move your eyes to the lower side...)

2. Vernacular concept:

The HTTP protocol is a form of data interaction (mutual transmission of data) between the server (Server) and the client (Client). We can personify Server and Client, then this protocol is a designated interactive communication method between the two brothers Server and Client.

3. HTTP working principle:

HTTP protocol works on the client-server architecture. As an HTTP client, the browser sends all requests to the HTTP server, that is, the WEB server, through the URL. The web server sends response information to the client based on the received request.

##                                                                                     

##4. Four points to note about HTTP:

- HTTP allows the transmission of any type of data object. The type being transferred is marked by Content-Type.

- HTTP is connectionless: connectionless means that it limits each connection to only process one request. After the server processes the client's request and receives the client's response, it disconnects. This method saves transmission time.

- HTTP is media independent: this means that any type of data can be sent over HTTP as long as the client and server know how to handle the data content. Clients and servers specify the appropriate MIME-type content type to use.

- HTTP is stateless: The HTTP protocol is a stateless protocol. Stateless means that the protocol has no memory ability for transaction processing. The lack of status means that if subsequent processing requires the previous information, it must be retransmitted, which may result in an increase in the amount of data transferred per connection. On the other hand, the server responds faster when it does not need previous information.

5. HTTP URL:

HTTP uses Uniform Resource Identifiers (URI) to transmit data and establish connections. URL is a special type of URI that contains enough information to find a resource.

URL, the full name is UniformResourceLocator, is called Uniform Resource Locator in Chinese, and is used to identify a certain resource on the Internet. the address of. Take the following URL as an example to introduce the components of a normal URL: http://www.aspxfans.com:8080/news/index.asp?boardID=5&ID=24618&page=1#name As can be seen from the above URL , a complete URL includes the following parts:

-Protocol part: The protocol part of the URL is "http:", which means that the web page uses the HTTP protocol. Various protocols can be used on the Internet, such as HTTP, FTP, etc. In this example, the HTTP protocol is used. The "//" after "HTTP" is the delimiter

- Domain name part: The domain name part of the URL is "www.aspxfans.com". In a URL, you can also use the IP address as the domain name using

- Port part: Following the domain name is the port, and ":" is used as the separator between the domain name and the port. The port is not a required part of a URL. If the port part is omitted, the default port will be used

- Virtual directory part: Starting from the first "/" after the domain name to the last "/", it is the virtual directory part. The virtual directory is also not a required part of a URL. The virtual directory in this example is "/news/"

- The file name part: Starting from the last "/" after the domain name and ending with "?", it is the file name part. If there is no "?", It starts from the last "/" after the domain name and ends with "#", which is the file part. If there are no "?" and "#", then it starts from the last "/" after the domain name and ends with the file name. part. The file name in this example is "index.asp". The file name part is not a necessary part of a URL. If this part is omitted, the default file name

- Anchor part: starting from "#" to the end, are the anchor parts. The anchor part in this case is "name". The anchor part is not a necessary part of a URL either

-Parameter part: The part from "?" to "#" is the parameter part, also known as the search part and query part. The parameter part in this example is "boardID=5&ID=24618&page=1". Parameters can allow multiple parameters, and "&" is used as a separator between parameters.

6. HTTP Request:

The request message sent by the client to the server includes the following components:

Message header: Often called a request header, the request header stores some main descriptions of the request (self-introduction). The server obtains the client's information accordingly.

Common request headers:

accept: The browser tells the server through this header the data types it supports

Accept-Charset: The browser tells the server through this header , which character set it supports
Accept-Encoding: The browser tells the server through this header, the supported compression format
Accept-Language: The browser tells the server through this header, its locale
Host: The browser tells the server through this header which host it wants to access
If-Modified-Since: The browser tells the server through this header the time to cache the data
Referer: The browser tells the server through this header that the client is Which page comes from? Anti-leeching
Connection: The browser tells the server through this header whether to disconnect or maintain the link after the request is completed
X-Requested-With: XMLHttpRequest represents access through ajax

User-Agent: The identity of the request carrier

#Message body: often called the request body, the request body stores the data information to be transmitted/sent to the server .

7.HTTP Response:

The server returns an HTTP response to the client. The response message includes the following components:

    

Status code: Tell the client the processing result of this request in "clear and unambiguous" language.

HTTP response status code consists of 5 segments:

1xx message, usually telling the client that the request has been received and is being processed, don’t worry...

2xx The processing is successful, which generally means: the request is received, I understand what you want, the request has been accepted, and the processing has been completed.

3xx Redirect to other places. It lets the client make another request to complete the entire process.

4xx If an error occurs in processing, the responsibility lies with the client. For example, the client requests a resource that does not exist, the client is not authorized, access is prohibited, etc.

5xx If an error occurs in processing, the responsibility lies with the server. For example, if the server throws an exception, the routing error occurs, the HTTP version is not supported, etc.

Corresponding header: response details display

Common corresponding header information:

Location: The server uses this header to tell the browser where to jump
Server: The server uses this header to tell the browser the server model
Content-Encoding: The server uses this header to tell the browser the compression format of the data
Content-Length: The server uses this header to tell the browser the length of the data sent back
Content-Language: The server uses this header to tell the browser the language environment
Content-Type: The server uses this header to tell the browser the type of data sent back
Refresh: The server uses this header to tell the browser the timing Refresh
Content-Disposition: The server uses this header to tell the browser to download the data
Transfer-Encoding: The server uses this header to tell the browser that the data is sent back in chunks
Expires: - 1 Control the browser not to cache
Cache-Control: no-cache
Pragma: no-cache

Corresponding body: The specified data sent to the client according to the request information specified by the client

2.HTTPS protocol

1.Official concept:

HTTPS (Secure Hypertext Transfer Protocol) secure hypertext transfer protocol. HTTPS establishes an SSL encryption layer on HTTP and The data is encrypted and is a secure version of the HTTP protocol.

2. Vernacular concept:

An encrypted and secure version of the HTTP protocol.

            

3. Encryption technology used by HTTPS

3.1 SSL encryption technology

The encryption technology used by SSL is called "shared key encryption" , also called "symmetric key encryption", this encryption method is like this. For example, the client sends a message to the server. First, the client will encrypt the message using a known algorithm, such as MD5 or Base64 encryption. A key is required when decrypting encrypted information, and the key is passed in the middle (the key for encryption and decryption is the same), and the key is encrypted during transmission. This method seems safe, but it is still potentially dangerous. Once it is eavesdropped or the information is hijacked, it is possible to crack the key and decipher the information. Therefore, there are security risks in the method of "shared key encryption":

There are two locks when using "asymmetric encryption", one is called the "private key" and the other is the "public key". When using non-object encryption encryption method, the server first tells the client according to The public key given by you is encrypted. After the client encrypts according to the public key, the server receives the information and then decrypts it through its own private key. The advantage of this is that the decrypted key will not be transmitted at all, so This also avoids the risk of being held hostage. Even if the public key is obtained by an eavesdropper, it will be difficult to decrypt because the decryption process involves evaluating discrete logarithms, which is not something that can be done easily. The following is the schematic diagram of asymmetric encryption:

        

But asymmetric key encryption technology also has the following shortcomings:

The first one is: how to ensure that the receiving end When sending the public key to the sender, the sender ensures that what is received is what was sent in advance and will not be hijacked. As long as the key is sent, there is a risk of being hijacked.

The second is: the efficiency of asymmetric encryption is relatively low, it is more complex to handle, and there are certain efficiency issues during the communication process that affect the communication speed

4.https Certificate mechanism

We talked about the shortcomings of asymmetric encryption above. The first one is that the public key is likely to be hijacked. There is no guarantee that the public key received by the client is the public key issued by the server. key. At this time, the public key certificate mechanism was introduced. The digital certificate certification authority is a third-party organization that is trusted by both the client and the server. The specific propagation process of the certificate is as follows:

1: The developer of the server carries the public key and applies for the public key to the digital certificate certification authority. The digital certificate certification authority will recognize the identity of the applicant and pass the review. In the future, the public key applied by the developer will be digitally signed, and then the signed public key will be distributed, and the key will be placed in the certificate and bound together

         

2: The server sends this digital certificate to the client. Because the client also recognizes the certificate authority, the client can verify the authenticity of the public key through the digital signature in the digital certificate to ensure that it is passed by the server. The public key is real. In general, the digital signature of a certificate is difficult to forge, depending on the credibility of the certification authority. Once the information is confirmed to be correct, the client will encrypt the message and send it using the public key, and the server will decrypt it with its own private key after receiving it.

The above is the detailed content of Detailed explanation of http and https protocols in python crawler (picture and text). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn