问题:有N个数对<a, b>,其中a的范围是[0, 100],b的范围是[0, 255],N < 10000,如何区分这N对数对?
我的需求最好是利用a和b的值通过某种算法产生一个数,每个数对产生不一样的值,这样就能相互区分了。
比如有下列这些数对:
<1, 2>
<2, 3>
<2, 1>
<7, 76>
<4, 2>
<45, 123>
...
等等,怎么设计一个利用a和b设计一个算法,来区分这些数对,比如a + b,当然我只是举个例子,简单的加减乘除都不行,因为有些数对比如<2, 1>与<1, 2>就不能区分了。
如果区分不了,那么能不能设计一个hash算法,使得这些数对全部映射到[0, M]之间,M最好<5000,这样就可能有不能区分的了,没关系,只要使得冲突最小也行。
重点:这样的需求不好解决问题,因为<a, b>是随机的,没什么分布规律。但<a, b>有一些分布特点:如果同一个a,b非常有可能是分段连续的,比如像下面这样:
<2, 1>
<2, 2>
<2, 3>
<2, 4>
<2, 5>
<2, 124>
<2, 125>
<2, 126>
<2, 127>
<2, 128>
<2, 129>
<2, 130>
...
我的要求:尽可能时间高效,比如用hash,一下就能判断。
当然有人会想到用最简单的map<int, int> m,m的大小就是数对的个数,但显然这样的方向时空性能都不太高,而且没有利用a和b的某些规律。
请大家发挥自己的想象力吧~~~
阿神2017-04-17 11:25:00
101*256 > 5000 So it is impossible not to conflict when M<5000
So how to design a hash algorithm with minimal conflict? This involves the distribution of the subject's data, so let's first consider the distribution to be uniform.
In the case of uniform distribution, this algorithm is actually very easy to design, just take the remainder hash = (a * 256 + b) % M
According to the above hash algorithm, let’s calculate the probability of hash conflict between two random pairs of different numbers a1, b1
and a2, b2
Easy to prove a*256 + b
This value corresponds to the number pair within the ab range of the topic, and it is exactly [0, 25600+255]
Then the probability of this conflict is equivalent to [0, 25855]内两个不同的随机整数对M的同余的概率
There is no formula editor, and the probability theory is almost returned to the teacher. The subsequent calculations and proofs are omitted. In short, the subject of the question directly takes the remainder under the assumption of average distribution, which is the hash algorithm with the least conflict. . More universal hash algorithms such as md5
and sha1
are mainly designed to minimize conflicts when the input length is not fixed and is not necessarily evenly distributed.
According to the distribution of topic updates, I suggest that the value of M be a multiple of 257. This can ensure that <x, 0-255>
or <0-100, y>
will not conflict, but there will be a problem that <x+1, y+1>
will definitely conflict with <x,y>
. If you want to ensure that "there will be no conflict when walking diagonally", you need to try to map it
PHP中文网2017-04-17 11:25:00
Algorithm: two-step hashing, your numbers are very standardized, and the ranges of a and b are given. Counting global hash, a has only 100 hash values, and b has 250, so there are only 100×250=25000 hash values in total, which exceeds your requirement of <5000. So just use two-step hashing.
Specific design: <a,b> First design a mapping function for a, which can be mapped to a to [0,50), and b to [0,99], so that there are a total of 50*100=5000 need.
In addition, there is a rule in your numbers. When a is the same, there will be continuous values of b. This can be used when designing the hash function of b. Hash(b)=b%100 can be compared. It's good to map consecutive b to different keys. But if you use a hash function that takes the high bit, consecutive b's will be mapped to the same key, and the conflict will be relatively large.
update:
Especially, you can use bitmap to do this. Each <a,b> is mapped to an integer of a256+b, so the actual total is [0,256101]. You can set a bitset< 256101+1> is enough. For each <a,b>, set bitset[256a+b] to 1. The total space required is not large. Access is also constant time.
巴扎黑2017-04-17 11:25:00
There is an O(n) algorithm and an O(2) algorithm, which can be positioned directly.
Isn’t this just a two-dimensional vector? Find the vector modulus and do a hash. When a conflict occurs, decide whether to take previous or next according to the direction.
For example:
1. Prepare a one-dimensional array to store node. This node has two pointers, one previous and one next, and then store the remaining part <a,b>
2. For the given <3, 4> (assuming this), take the vector modulus = 5 = hashkey. Calculating the root is time-consuming, so there is no need to calculate the root. Our 'hashkey' will be = 25.
3. You can use any hash method to calculate the index of a one-dimensional array. I prefer modulo, so use 'hashkey% array length' to get an index.
4. If there is an empty node in the corresponding index, then put this node in. If it is not empty, then calculate the vector direction '4-3 > 0'. For nodes with a direction greater than 0, use next to point to the direction. If it is less than 0, use previous to refer to it.
5. Then continue to access previous or next according to the last calculated direction until you find it.
Because there is a direction, it is actually very fast to find the node, much faster than the general hash algorithm. Because after a collision occurs, the node list in the other direction does not need to be traversed directly.
If the module and direction are the same, it is easy to find the same node.
The O(2) algorithm is based on the above and further optimized. You can do another round of hashing based on the slope of the vector, then the second step will directly locate the unique vector, and there will be no conflict (vectors with the same module length and the same slope are the same vector, and the question has an implicit condition, all vectors The starting point is always the origin <0,0>). If a conflict does occur, congratulations, you have found the duplicate node. But I think one round of hashing is almost enough. After one traversal, all duplicate nodes can be found.
In the end it will probably look like this:
--------| <4,3> |
<1,1> | <3,4> | <1,3> |
----|
For example, given a <4,3>, you will find that the position corresponding to 5 has <3,4>, and then you look at the direction and it is less than 0, then take the previous of that node of <3,4> The pointer looks up, eh, it instantly finds <4,3>~ Because there is a direction, <square root 5, square root 5> does not need to be compared directly.
黄舟2017-04-17 11:25:00
A number from 0 to 25855 can be used to express a <a,b> (25855=256100+255, which is the maximum number of combinations). Suppose this number is x, then x=a256+b. Use a bool array mark[i] to indicate whether the number i has appeared. So we separated. I think this method has the least programming complexity.
If you need the highest efficiency, you should still use open hashing
高洛峰2017-04-17 11:25:00
It is indeed what is said above hashed = a*256+b
But the two people above and the questioner seem to have overlooked an issue. There are 10,000 numbers, but the hash range is only 5,000. Isn’t the conflict rate 200%?