c++ - 算法题：怎么区分互不相同的数对，hash还是？

Question

问题：有N个数对&lt;a, b&gt;，其中a的范围是[0, 100]，b的范围是[0, 255]，N &lt; 10000，如何区分这N对数对？ 我的需求最好是利用a和b的值通过某种算法产生一个数，每个数对产生不一样的值，这样就能相互区分了...

阿神 · Answer

101*256 > 5000 So it is impossible not to conflict when M<5000

So how to design a hash algorithm with minimal conflict? This involves the distribution of the subject's data, so let's first consider the distribution to be uniform.

In the case of uniform distribution, this algorithm is actually very easy to design, just take the remainder hash = (a * 256 + b) % M

According to the above hash algorithm, let’s calculate the probability of hash conflict between two random pairs of different numbers a1, b1 and a2, b2

Easy to prove a*256 + bThis value corresponds to the number pair within the ab range of the topic, and it is exactly [0, 25600+255]

Then the probability of this conflict is equivalent to [0, 25855]内两个不同的随机整数对M的同余的概率

There is no formula editor, and the probability theory is almost returned to the teacher. The subsequent calculations and proofs are omitted. In short, the subject of the question directly takes the remainder under the assumption of average distribution, which is the hash algorithm with the least conflict. . More universal hash algorithms such as md5 and sha1 are mainly designed to minimize conflicts when the input length is not fixed and is not necessarily evenly distributed.

According to the distribution of topic updates, I suggest that the value of M be a multiple of 257. This can ensure that or <0-100, y> will not conflict, but there will be a problem that will definitely conflict with . If you want to ensure that "there will be no conflict when walking diagonally", you need to try to map it

PHP中文网 · Answer

Algorithm: two-step hashing, your numbers are very standardized, and the ranges of a and b are given. Counting global hash, a has only 100 hash values, and b has 250, so there are only 100×250=25000 hash values in total, which exceeds your requirement of <5000. So just use two-step hashing.

Specific design: First design a mapping function for a, which can be mapped to a to [0,50), and b to [0,99], so that there are a total of 50*100=5000 need.

In addition, there is a rule in your numbers. When a is the same, there will be continuous values of b. This can be used when designing the hash function of b. Hash(b)=b%100 can be compared. It's good to map consecutive b to different keys. But if you use a hash function that takes the high bit, consecutive b's will be mapped to the same key, and the conflict will be relatively large.

update:

Especially, you can use bitmap to do this. Each is mapped to an integer of a256+b, so the actual total is [0,256101]. You can set a bitset< 256101+1> is enough. For each , set bitset[256a+b] to 1. The total space required is not large. Access is also constant time.

巴扎黑 · Answer

There is an O(n) algorithm and an O(2) algorithm, which can be positioned directly.

Isn’t this just a two-dimensional vector? Find the vector modulus and do a hash. When a conflict occurs, decide whether to take previous or next according to the direction.

For example:
1. Prepare a one-dimensional array to store node. This node has two pointers, one previous and one next, and then store the remaining part
2. For the given <3, 4> (assuming this), take the vector modulus = 5 = hashkey. Calculating the root is time-consuming, so there is no need to calculate the root. Our 'hashkey' will be = 25.
3. You can use any hash method to calculate the index of a one-dimensional array. I prefer modulo, so use 'hashkey% array length' to get an index.
4. If there is an empty node in the corresponding index, then put this node in. If it is not empty, then calculate the vector direction '4-3 > 0'. For nodes with a direction greater than 0, use next to point to the direction. If it is less than 0, use previous to refer to it.
5. Then continue to access previous or next according to the last calculated direction until you find it.

Because there is a direction, it is actually very fast to find the node, much faster than the general hash algorithm. Because after a collision occurs, the node list in the other direction does not need to be traversed directly.

If the module and direction are the same, it is easy to find the same node.

The O(2) algorithm is based on the above and further optimized. You can do another round of hashing based on the slope of the vector, then the second step will directly locate the unique vector, and there will be no conflict (vectors with the same module length and the same slope are the same vector, and the question has an implicit condition, all vectors The starting point is always the origin <0,0>). If a conflict does occur, congratulations, you have found the duplicate node. But I think one round of hashing is almost enough. After one traversal, all duplicate nodes can be found.

In the end it will probably look like this:
--------| <4,3> |
<1,1> | <3,4> | <1,3> |
----|

Horizontal lines represent spaces and have no meaning

For example, given a <4,3>, you will find that the position corresponding to 5 has <3,4>, and then you look at the direction and it is less than 0, then take the previous of that node of <3,4> The pointer looks up, eh, it instantly finds <4,3>~ Because there is a direction, does not need to be compared directly.

黄舟 · Answer

A number from 0 to 25855 can be used to express a (25855=256100+255, which is the maximum number of combinations). Suppose this number is x, then x=a256+b. Use a bool array mark[i] to indicate whether the number i has appeared. So we separated. I think this method has the least programming complexity.

If you need the highest efficiency, you should still use open hashing

高洛峰 · Answer

It is indeed what is said above hashed = a*256+b But the two people above and the questioner seem to have overlooked an issue. There are 10,000 numbers, but the hash range is only 5,000. Isn’t the conflict rate 200%?

c++ - 算法题：怎么区分互不相同的<a, b>数对，hash还是？

reply all(5)I'll reply