Home > Article > Backend Development > What is the data structure Hash table (hash table)? What are the specific operations?
1. What is a Hash table
If you want to know what a hash table is, you must first understand the hash function
Hash function:
Compared with the binary sorting tree, the binary balanced tree, the red-black tree B B tree discussed in the previous blog, their search is performed from the root node first, and the data or index is taken out from the node and the search value is performed. Compare. So, is there a function H? Based on this function and the search keyword key, the location of the search value can be directly determined without the need for comparison one by one. In this way, you can "know" the location of the key in advance, find the data directly, and improve efficiency.
That is, address index=H (key)
To put it bluntly, the hash function calculates the location where the address should be stored based on the key, and the hash table is a search based on the hash function. Table
2, construction method of hash function
According to previous experience, The following are the construction methods of commonly used hash functions:
Direct customization method
The hash function is a linear function of the keyword such as H (key) = a* key b
This construction method is relatively simple and uniform, but it has great limitations and is limited to the situation where the address size = keyword set
Usage example:
Assume that it is needed Statistics on the age distribution of China's population, with 10 as the smallest unit. This year is 2018, so those under 10 years old are distributed in 2008-2018, and those under 20 years old are distributed in 1998-2008... Assuming that 2018 represents the direct data of 2018-2008, then the keywords should be 2018, 2008, 1998...
Then the hash function H(key)=(2018-key)/10=201-key/10 can be constructed
Then the hash table is established as follows
index key age number of people (Assumed data)
0 2018 0-10 200W
1 2008 10-20 250W
2 1998 20-30 253W
3 1988 30- 40 300W
……
Number analysis method
Assume that each keyword key in the keyword set is composed of s digits (k1, k2 ,...,knk1,k2,...,kn), analyze the entire data in the key, and extract a number of evenly distributed bits or their combination to form the entire
Usage example
us Knowing that ID numbers are regular, now we want to store the ID numbers of students in a class. Assume that the students in this class were all born in the same area and in the same year, then the first digits of their ID cards are the same, then We can intercept the following different bits to store. Suppose there are 5 different bits, then use these five bits to represent the address.
H(key)=key 0000
This method is usually used when the number of digits is long. There must be certain rules in the numbers, and the distribution of the numbers must be known, such as the above For example, we know in advance that the students in this class were born in the same year and in the same area.
Squaring method
If each digit of a keyword has certain numbers that appear frequently, you can first find the square value of the keyword , expand the difference by squaring, and then take the middle digit as the final storage address.
Usage example
For example, key=1234 1234^2=1522756, take 227 as the hash address
For example, key=4321 4321^2=18671041, take 671 as the hash address
This method is suitable for situations where the data is not known in advance and the data length is small
Folding method
If the number has many digits, the number can be divided into several part, take their superimposed sum as the hash address
Usage example
For example key=123 456 789
We can store it in 61524, take the last three digits, and there is a position of 524
This method is suitable for numbers When there are many digits and the data distribution is not known in advance
The method of leaving the remainder is often used
H (key)=key MOD p (pVery Obviously, how to choose p is a key issue.
Usage example
For example, if we store 3 6 9, then p cannot be 3
Because 3 MOD 3 == 6 MOD 3 == 9 MOD 3
p should be a prime number no larger than m or a composite number without prime factors below 20, which can reduce address duplication (conflict)
For example, key = 7, 39, 18, 24, 33, when 21, take the table length m as 9 and p as 7, then store it as follows
Random number method
H(key)=Random( key)
Take the random function value of the keyword as its hash address
Considerations in the design of hash function
1. The time required to calculate the hash address (that is, the hash function itself should not be too complicated)
2. The length of the keyword
3. Table length
4. Whether the distribution of keywords is even and whether there are rules to follow
5. The hash function is designed to minimize conflicts while meeting the above conditions
3. Hash conflict
That is, different key values generate the same address, H (key1) = H (key2)
For example, when we store 3 6 9 above, p takes 3 as
3 MOD 3 == 6 MOD 3 == 9 MOD 3
At this time, hash conflicts occurred in 3, 6 and 9
Solution to hash conflicts
No matter how cleverly the hash function is designed, there will always be special keys that cause hash conflicts, especially for dynamic lookup tables.
The hash function has the following common methods to resolve conflicts
1. Open customization method
2. Chain address method
3. Public overflow area method
Create a special storage space to store conflicting data. This method is suitable for situations where there are few data and conflicts.
4. Rehash method
Prepare several hash functions. If a conflict occurs using the first hash function, use the second hash function. The two also conflict, use the third one...
Focus on the open customization method and the chain address method
Open customization method
First there is an H (key) Hash function
If H(key1)=H(keyi)
Then keyi storage location Hi=(H(key) di)MODmHi=(H(key) di)MODmm is the table Long
Note
Increment di should have the following characteristics (completeness): the generated Hi (address) are all different, and the generated s (m-1) Hi can cover all addresses in the hash table
* During square detection, the table length m must be a prime number of 4j 3 (square detection table length is limited)
* Random There is no common factor between m and di during detection (there are restrictions on random detection of di)
Examples of conflict resolution solutions for three open addressing methods
There is a set of data
19 01 23 14 55 68 11 86 37 is to be stored in an array with a table length of 11, where H (key) = key MOD 11
Then according to the above three conflict resolution methods, the storage process is as follows:
(Explanation of the table: Insert data from front to back. If the insertion position is already occupied and a conflict occurs, start a new row for the conflict and calculate the address until the address is available. Continue to start a new row for subsequent conflicts. Final result Take the top data (because it is the most "occupying" data))
Linear detection and then hashing
We take di=1, that is, storage after conflict At a position after the conflict, if there is still a conflict continue backward
Square detection and then hashing
Random detection in hashing (double detection and then hashing)
After conflict
H(key)'=(H(key) di)MOD m
In this example
H (key)=key MOD 11
We take di=key MOD 10 1
The result is as follows:
Chain address method
After a hash conflict occurs, add a pointer after the stored data to point to the conflicting data
In the above example, the chain address method is as follows:
4. Searching the hash table
The search process is the same as the table creation process. Assuming that the open addressing method is used to handle conflicts, then The search process is:
For a given key, calculate the hash address index = H (key)
If the value of the array arr [index] is empty, the search will be unsuccessful
If the array arr[index] == key, the search is successful
Otherwise, use the conflict resolution method to find the next address until arr[index] == key or arr[index] == null
It can be seen that no matter which function, the larger the loading factor, the larger the average search length, then the smaller the loading factor α, the better? No, just like a table length of 100 only stores one data, α is small, but the space utilization is not high. This is a trade-off between time and space. Under normal circumstances, α=0.75 is considered to be the situation with the highest comprehensive utilization efficiency of time and space.
The table above is particularly useful. Suppose I now have 10 data, want to use the chain address method to resolve conflicts, and require the average search length
, then there are 1 α/2
α
That is, n/m
m>10/2
m>5 That is, the chain address method is used to make the average search length
My blog has discussed the average search length of various trees before. They are all based on functions of storing data n, but hash tables are different. They are based on loading factors. That is to say, when When data n increases, I can increase the table length m to keep the load factor unchanged and ensure that ASL remains unchanged.
Then the structure of the hash table should be like this:
5. Deletion of the hash table
First The chain address method can directly delete elements, but the open addressing method cannot. Take the previous double detection and hashing as an example. If we delete element 1 and leave its position blank, then 23 will never be found. The correct approach should be to insert data that does not exist before deleting it, such as -1.
For more related questions, please visit the PHP Chinese website: PHP practical video tutorial
The above is the detailed content of What is the data structure Hash table (hash table)? What are the specific operations?. For more information, please follow other related articles on the PHP Chinese website!