Home  >  Article  >  Java  >  Use it every day! Do you know what HASH is?

Use it every day! Do you know what HASH is?

Java学习指南
Java学习指南forward
2023-07-26 14:47:45633browse

Have you ever used HashMap? So have you seriously understood what Hash is? This article will take you through an in-depth study of the Hash algorithm from a principle level.

1. The concept of hashing

The main idea of ​​the hashing method is to determine the storage address of the node based on the key code value: based on the key code The value K is an independent variable. Through a certain functional relationship h(K) (called a hash function), the corresponding function value is calculated. This value is interpreted as the storage address of the node, and the node is stored in this storage. in the unit. When retrieving, use the same method to calculate the address, and then go to the corresponding unit to get the node you are looking for. Nodes can be quickly retrieved through hashing. Hash (also called "hash") is an important storage method and a common retrieval method.

The storage structure constructed according to the hash storage method is called a hash table (hash table). A position in the hash table is called a slot. The core of hashing technology is the hash function. For any given dynamic lookup table DL, if an "ideal" hash function h and the corresponding hash table HT are selected, then for each data element X in DL. The function value h (X.key) is the storage location of X in the hash table HT. Data element X will be placed at this position when inserting (or creating a table), and will also be searched at this position when retrieving X. The storage location determined by the hash function is called the hash address. Therefore, the core of hashing is: the hash function determines the corresponding relationship between the key code value (X.key) and the hash address h (X.key). Through this relationship, organizational storage and retrieval are realized.

Generally, the storage space of the hash table is a one-dimensional array HT[M], and the hash address is the subscript of the array. The goal of designing a hashing method is to design a certain hash function h, 0<=h(K)

2. Hash function

In the following discussion, we assume that we are dealing with key codes with integer values. Otherwise, we can always create a key code with a normal value. One-to-one correspondence between integers, thereby converting the retrieval of the key into a retrieval of the corresponding positive integer; at the same time, it is further assumed that the value of the hash function falls between 0 and M-1. The principles for selecting a hash function are: the operation is as simple as possible; the value range of the function must be within the range of the hash table; the nodes should be distributed evenly as much as possible, that is, try to make different keys have different hash function values. Various factors need to be considered: key length, hash table size, key distribution, record retrieval frequency, etc. Below we introduce several commonly used hash functions.

1. Division by remainder method

#As the name suggests, division by remainder method is to divide the key code x by M (often taking the length of the hash table), And take the remainder as the hash address. The remainder method is almost the simplest hashing method, and the hash function is: h(x) = x mod M.

2. Multiply remainder method

When using this method, first multiply the key code by a constant A (0< A < 1), extract the decimal part of the product. Then, multiply this value by the integer n, round down the result, and use it as the hash address. The hash function is: hash (key) = _LOW(n × (A × key % 1)). Among them, "A × key % 1" means taking the decimal part of A × key, that is: A × key % 1 = A × key - _LOW(A × key), and _LOW(X) means taking the integer part of X.

##3. Square method

Since integer division usually runs slower than multiplication, consciously avoid using the remainder Legal operations can improve the running time of hashing algorithms. The specific implementation of the square-middle method is to first find the square value of the key code to expand the difference between similar numbers, and then take the middle few digits (often binary bits) according to the length of the table as the hash function value. Because the middle digits of a product are related to each digit of the multiplier, the resulting hash address is more uniform.

Use it every day! Do you know what HASH is?

4. Numerical analysis method

Assume that each keyword in the keyword set is It consists of s digits (u1, u2, …, us), analyzes the entire keyword set, and extracts several evenly distributed bits or their combination as addresses. The digital analysis method is a method of taking some digital bits with relatively uniform values ​​in the data element keywords as the hash address. That is, when the keyword has a lot of digits, you can analyze the bits of the keyword and discard the unevenly distributed bits as a hash value. It is only suitable when all keyword values ​​are known. By analyzing the distribution, the keyword value range is converted into a smaller keyword value range.

For example: To construct a hash table with the number of data elements n=80 and the hash length m=100. Without loss of generality, we only give 8 keywords for analysis here. The 8 keywords are as follows:

K1=61317602 K2=61326875 K3=62739628 K4=61343634

K5=62706815 K6=62774638 K7=61381262 K8=61394220

Analysis of the above 8 keywords shows that the values ​​​​of the 1st, 2nd, 3rd, and 6th digits from left to right of the keywords are relatively concentrated and should not be used as Haha. Hash address, the remaining 4th, 5th, 7th, and 8th bits have even values, and two of them can be selected as the hash address. Suppose the last two digits are selected as the hash address, then the hash addresses of these eight keywords are: 2, 75, 28, 34, 15, 38, 62, 20.

This method is suitable for: being able to estimate in advance the frequency of occurrence of various numbers on each digit of all keywords.

5. Base conversion method

Consider the key value as a number in another base system and then convert it into the original base number , and then select several of them as hash addresses.

Example Hash(80127429)=(80127429)13=8137 0136 1135 2134 7133 4132 2*131 9=(502432641)10 If you take the middle three digits as the hash value, you get Hash(80127429)=432 In order to obtain a good hash function, several methods can be used in combination, such as changing the base first, then folding or squaring, etc. As long as the hash is uniform, you can piece it together at will.

6. Folding method

#Sometimes the key code contains many digits, and it is too complicated to use the square-centering method, then you can The key code is divided into several parts with the same number of digits (the number of digits in the last part can be different), and then the superposition sum of these parts (carry out) is taken as the hash address. This method is called the folding method.

is divided into:

  • Shift superposition: Align and add the low bits of the divided parts.
  • Border Overlay: Fold back and forth from one end along the dividing boundary, then align and add.

3. Conflict resolution strategy

Although the goal of the hash function is to minimize conflicts, in fact conflicts are unavoidable. Therefore, we must look into conflict resolution strategies. Conflict resolution techniques can be divided into two categories: open hashing (also known as zipper method, separate chaining) and closed hashing method (closed hashing, also known as open address method, open addressing). The difference between these two methods is that the open hashing method stores the colliding key outside the main hash table, while the closed hashing method stores the colliding key in another slot in the table.

1. Separated linked list method

(1)Zipper method

A kind of hashing method The simple form is to define each slot in the hash table as the head of a linked list. All records that hash to a specific slot are put into the linked list for that slot. Figure 9-5 illustrates an open hash table in which each slot stores a record and a pointer to the rest of the linked list. These 7 numbers are stored in a hash table with 11 slots, and the hash function used is h(K) = K mod 11. The insertion order of numbers is 77, 7, 110, 95, 14, 75 and 62. There are 2 values ​​hashed to slot 0, 1 value hashed to slot 3, 3 values ​​hashed to slot 7, and 1 value hashed to slot 9.

Use it every day! Do you know what HASH is?

2. Closed hashing method (open address method)

The closed hashing method combines all Records are stored directly in the hash table. Each record key key has a base position calculated by a hash function, that is, h(key). If a key is to be inserted and another record already occupies the base position of R (a collision occurs), then R is stored at another address in the table, and the conflict resolution policy determines which address it is.

The basic idea of ​​closed hash table conflict resolution is: when a conflict occurs, use a certain method to generate a hash address sequence d0, d1, d2,...di,...dm for the key K -1. Among them, d0=h (K) is called the base address home position of K; all di (0< i< m) are subsequent hash addresses. When K is inserted, if the node at the base address is already occupied by other data elements, it will be probed in sequence according to the above address sequence, and the first open free position di found will be used as the storage location of K; if all subsequent hashes No addresses are free, indicating that the closed hash table is full and an overflow is reported. Correspondingly, when retrieving K, the sequence of successor addresses with the same value will be searched in sequence, and the position di will be returned when the retrieval is successful; if an open free address is encountered when retrieving along the probing sequence, it means that there is no key to be searched in the table code. When deleting K, we also search sequentially according to the sequence of successor addresses with the same value. If we find that a certain position di has the K value, then delete the data element at the position di (the deletion operation actually just marks the node for deletion); If an open free address is encountered, it means that there is no key to be deleted in the table. Therefore, for closed hash tables, the method of constructing a sequence of subsequent hash addresses is the method of handling conflicts.

Different methods of forming probes lead to different methods of resolving conflicts. Below are several common construction methods.

(1) Linear detection method

Treat the hash table as a ring table. If a conflict occurs at the base address d (i.e. h(K)=d) , then the following address units are explored in sequence: d 1, d 2,..., M-1, 0, 1,..., d-1 until a free address is found or the key code is found to the node of key. Of course, if you return to address d after searching along the probing sequence, it means failure whether you are doing an insertion operation or a retrieval operation. The probe function for simple linear probe is: p(K,i) = i

Example 9.7 It is known that a set of key codes is (26, 36, 41, 38, 44, 15, 68, 12, 06, 51, 25), the length of the hash table M = 15, and the linear exploration method is used to resolve the conflict structure A hash table of this set of keys. Because n=11, use the remainder method to construct a hash function and select the largest prime number P=13 less than M. Then the hash function is: h(key) = key. Insert each node in order: 26: h(26) = 0, 36: h(36) = 10, 41: h(41) = 2, 38: h(38) = 12, 44: h(44) = 5. When 15 is inserted, its hash address is 2. Since 2 is already occupied by the element with key code 41, it needs to be probed. According to the sequential probing method, it is obvious that 3 is an open free address, so it can be placed in unit 3. Similarly, 68 and 12 can be placed in units 4 and 13 respectively.

(2) Secondary exploration method

The basic idea of ​​the secondary exploration method is: The generated subsequent hash addresses are not consecutive, but skipped to leave room for subsequent data elements and thus reduce aggregation. The detection sequence of the secondary detection method is: 12, -12, 22, -22,... etc. That is to say, when a conflict occurs, the synonyms are hashed back and forth at both ends of the first address. The formula for finding the next open address is:

Use it every day! Do you know what HASH is?
Use it every day! Do you know what HASH is?

##(3) Random exploration method

The ideal probing function should randomly select the next position from an unvisited slot in the probing sequence, that is, the probing sequence should be a random permutation of the hash table positions. However, we cannot actually randomly select a position from the probe sequence because the same probe sequence cannot be established when retrieving the key. However, we can do something like pseudo-random probing. In pseudo-random probing, the i-th slot in the probing sequence is (h(K) ri) mod M, where ri is a "random" sequence of numbers between 1 and M - 1. All insertions and retrievals use the same "random" number. The probe function will be p(K,i) = perm[i - 1], where perm is an array of length M - 1 containing a random sequence of values ​​from 1 to M - 1.

Example:

For example, the known hash table length m=11, the hash function is: H (key) = key % 11, then H (47) = 3, H ( 26)=4, H(60)=5, assuming the next keyword is 69, then H(69)=3, which conflicts with 47. If linear detection and then hashing are used to handle conflicts, the next hash address is H1=(3 1)% 11 = 4, and there is still a conflict, and then the next hash address is H2=(3 2)% 11 = 5, still the same If there is a conflict, continue to find the next hash address as H3 = (3 3)% 11 = 6. At this time, there is no conflict. Fill 69 into unit 5, see Figure 8.26 (a). If you use secondary detection and hashing to handle conflicts, the next hash address is H1=(3 12)% 11 = 4. If there is still a conflict, find the next hash address H2=(3 - 12)% 11 = 2 , there is no conflict at this time, fill 69 into unit 2, see Figure 8.26 (b). If pseudo-random detection and then hashing are used to handle conflicts, and the pseudo-random number sequence is: 2, 5, 9,…….., then the next hash address is H1 = (3 2)% 11 = 5, and there is still a conflict. Find the next hash address as H2 = (3 5)% 11 = 8. At this time, there is no conflict. Fill 69 into unit 8, see Figure 8.26 (c).

Use it every day! Do you know what HASH is?

(4)Double hash detection method

Both pseudo-random probing and secondary probing can eliminate the problem of basic aggregation - that is, key codes with different base addresses, and some segments of their probing sequences overlap. However, if two keys hash to the same base address, the same probe sequence will still be obtained using both methods, and aggregation will still occur. This is because the probing sequence generated by pseudo-random probing and secondary probing is only a function of the base address, not a function of the original key value. This problem is called secondary clustering.

In order to avoid secondary aggregation, we need to make the exploration sequence a function of the original key value rather than a function of the base position. The double hash detection method uses the second hash function as a constant, skips the constant term each time, and performs linear detection.

The above is the detailed content of Use it every day! Do you know what HASH is?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:Java学习指南. If there is any infringement, please contact admin@php.cn delete