Home >Technology peripherals >AI >Understand the Hash algorithm and application scenarios in one article
Both hash and hash come from the word hash, the former is a transliteration and the latter is a free translation. It is an algorithm that can map a binary value of any length into a fixed-length binary value. The mapped fixed-length binary value is called a hash value. An excellent hash algorithm needs to meet the following requirements:
cannot reversely deduce the original data from the hash value;
is very sensitive to the input data, and a different bit will cause hashing. The hash values are very different;
The probability of hash conflict must be very small;
The calculation process of the hash algorithm must be simple and efficient enough, so that even if the original data is very long, the hash value can be obtained quickly Hash value;
The more common hash encryption algorithms are MD5 (MD5 Message-Digest Algorithm, MD5 message Digest algorithm) and SHA (Secure Hash Algorithm, secure hash algorithm).
The plaintext password cannot be deduced from the hash value ciphertext, and the probability of hash conflict is relatively small. These two points ensure the reliability of the hash algorithm as a secure encryption method.
Why hash algorithms cannot completely avoid hash conflicts, but can only minimize them?
The pigeon nest principle tells us that if 11 pigeons fly into 10 pigeon cages, then there must be 2 or more pigeons in one pigeon cage. Then the hash value is of fixed length, which determines that the hash value can be exhausted, but in theory the original data is infinite, so it is possible to cause a hash conflict.
This application scenario uses the characteristics 1 and 3 of the hash algorithm. Among them, 3 ensures that the password is very difficult to be cracked in the forward direction (taking MD5 as an example, the hash value length is 128 bits, and there are 2 ^128 different hashes, very difficult to crack).
There is no absolute security in the security field. Although MD5 is difficult to crack, there are still ways to crack it. For example, using rainbow table matching can easily crack common passwords.
So generally we will use a salted hash algorithm for secure encryption. The salting method needs to be kept strictly confidential, which greatly increases the difficulty and cost of cracking.
When we verify whether two files are the same, we cannot simply judge by the file name. Because the existence of files with the same name is too common.
We can take some binary data from a large file according to specific rules, and use the hash algorithm to obtain the hash value as the unique identifier of the file. In this way, the same files must have the same hash value, that is, the same unique identifier; different files have a high probability of having different hash value unique identifiers;
Even if you really encounter a scattered If there is a column conflict, we can compare all the binary data of the two files in detail to further determine whether they are the same file. The probability of this event happening is too small. However, this solution ensures both efficiency and reliability.
This application scenario uses features 2 and 3 of the hash algorithm.
In the P2P download protocol, we will download different parts of the same movie from different machines, and then assemble the movie on our own machine. If there is an error in the downloading process of some part of the movie or the content is tampered with, it may cause downloading errors or viruses.
Therefore, we first perform hash calculation on all parts and save them in the seed file. After all parts are downloaded, we hash all parts to obtain the hash value, and then compare it with the one in the seed file to verify whether the file is complete.
This application scenario uses features 2 and 4 of the hash algorithm.
This scenario has been introduced before when talking about hash tables. In this scenario, the requirements for Feature 1 are not very high. The requirement for Feature 2 is that the hash values should be distributed as evenly as possible. Feature 3 can also accept conflicts to a certain extent, which can be solved by using the open addressing method and the zipper method. Feature 4 is more demanding and needs to pursue performance.
There are many load balancing algorithms, such as polling, random, weighted polling, etc., but the goal is to implement a session sticky load balancing algorithm, that is, the same All client requests during a session are routed to the same server.
We can hash the client's IP or session ID, and perform a modulo operation on the hash value and the number of servers. The final value is the server that needs routing, so that session stickiness can be achieved. purpose of stagnation.
When we need to process massive data, a single server cannot load and calculate such massive data, then we need to evenly distribute the massive data to N servers The server performs parallel computing. How to distribute the data evenly to N servers?
We perform a hash calculation on the data, and use the obtained hash value modulo the number of servers N. Data with the same result will be assigned to the same server and handed over to this server for processing. N servers process massive data in parallel and finally merge the results.
Store massive data in a distributed cache or distributed database. The idea of borrowing is similar to the above data sharding. However, what should be done when the number of originally set servers is not enough?
It cannot be solved by simply adding a few machines. This will destroy the modulo operation of the hash value, lead to cache penetration, and cause an avalanche effect. Likewise, the same problem can result when a machine fault is removed. At this time, we need to use consistent hashing algorithm to solve this problem.
The consistent hash algorithm is simply to construct a hash ring with 2^32 nodes on the ring, and hash the server IP and files to the corresponding nodes. The first server that all files encounter clockwise is the server where they are stored. In this way, when a server is added or deleted, the number of files affected can be controlled and will not cause a global avalanche.
hash ring
However, with a certain probability, when the server IP is mapped to the hash ring, the problem of hash ring skew will occur. This will lead to extremely uneven distribution of files on the server, degenerating into a scenario that easily causes an avalanche effect when adding or deleting servers at the beginning.
The skewness of the hash ring
We can artificially add a number of virtual nodes to these servers so that all server nodes are evenly distributed on the hash ring.
Hash ring with virtual nodes
The usage scenarios of Hash algorithm are far more than the above, there are also such as CRC check.
The above is the detailed content of Understand the Hash algorithm and application scenarios in one article. For more information, please follow other related articles on the PHP Chinese website!