search
HomeBackend DevelopmentPHP TutorialProgrammer's Advanced Chapter: The Temperament of Hash Table

Mr. Zhang Yanpo joined Baishan Cloud Technology in 2016 and is mainly responsible for object storage research and development, data distribution across computer rooms, and repair and problem solving. With the goal of achieving 100PB-level data storage, he led the team to complete the design, implementation and deployment of a network-wide distributed storage system, separating "cold" and "hot" data, reducing the cost of cold data to 1.2 times the redundancy.

Mr. Zhang Yanpo worked for Sina from 2006 to 2015 and was responsible for the architecture design, collaboration process formulation, code specification and implementation standards formulation and most function implementation of Cross-IDC PB-level cloud storage service. Worked to support Sina's internal storage services such as Weibo, microdisk, video, SAE, music, software downloads, etc. From 2015 to 2016, he served as a senior technical expert at Meituan and designed a cross-machine room 100 PB object storage solution: Design and implement a high-concurrency and highly-reliable multi-copy replication strategy, and optimize erasure code to reduce 90% of IO overhead.

In software development, a hash table is equivalent to randomly placing n keys into b buckets to achieve storage of n data in b unit spaces.

We found some interesting phenomena in the hash table:

The distribution pattern of keys in the hash table

When the hash table When the number of keys and buckets is the same (n/b=1):

  • 37% of the buckets are empty

  • 37% of the buckets There is only 1 key in it

  • 26% of the buckets have more than 1 key (hash conflict)

The following figure shows it intuitively When n=b=20, the number of keys in each bucket in the hash table (sort the buckets according to the number of keys):

Programmers Advanced Chapter: The Temperament of Hash Table

Often we use the hash table The first impression is: if keys are randomly put into all buckets, the number of keys in the bucket will be relatively uniform, and the expected number of keys in each bucket is 1.

In fact, the distribution of keys in the bucket is very uneven when n is small; when n increases, it will gradually become even.

The impact of the number of keys on the number of buckets in category 3

The following table shows how the value of n/b affects category 3 when b remains unchanged and n increases. The proportion of the number of buckets (the conflict rate is the proportion of buckets containing more than 1 key):

Programmers Advanced Chapter: The Temperament of Hash Table

To be more intuitive, we use the following figure to show the empty bucket rate and The changing trend of conflict rate with n/b value:

Programmers Advanced Chapter: The Temperament of Hash Table

The impact of the number of keys on the uniformity of the bucket

The above sets of numbers It is a meaningful reference value when n/b is small, but as n/b gradually increases, the number of empty buckets and buckets with one key is almost 0, and most buckets contain multiple keys.

When n/b exceeds 1 (one bucket allows multiple keys to be stored), our main observation object becomes the distribution pattern of the number of keys in the bucket.

The following table shows the degree of unevenness when n/b is large and the number of keys in each bucket tends to be uniform.
In order to describe the degree of unevenness, we use the ratio between the maximum and minimum number of keys in the bucket ((most-fewest)/most) to express it.

The following table lists the distribution of keys as n increases when b=100.

Programmers Advanced Chapter: The Temperament of Hash Table

It can be seen that as the average number of keys in the bucket increases, the unevenness of its distribution gradually decreases.

和空bucket或1个key的bucket的占比不同n/b,均匀程度不仅取决于n/b的值,也受b值的影响,后面会提到。

未使用统计中常用的均方差法去描述key分布的不均匀程度,是因为软件开发过程中,更多时候要考虑最坏情况下所需准备的内存等资源。

Load Factor: n/b

A concept commonly used in hash tables, load factor α=n/b, is used to describe the characteristics of hash tables.

Usually, based on memory storage hash table, its n/b ≤0.75. This setting can not only save space, but also keep the key conflict rate relatively low. A low conflict rate means low frequency hash relocation, and the insertion of the hash table will be faster.

线性探测是一个经常被使用的解决插入时hash冲突的算法,它在1个bucket出现冲突时,按照逐步增加的步长顺序向后查看这个bucket后面的bucket,直到找到1个空bucket。因此它对hash的冲突非常敏感。

In this scenario of n/b=0.75, if linear detection is not used (such as using a linked list in the bucket to save multiple keys), about 47% of the buckets are empty; if linear detection is used , among these 47% buckets, about half of the buckets will be filled by linear detection.

在很多内存hash表的实现中,选择n/b=<p><strong>Hash table feature tips: </strong></p>
  • The hash table itself is an algorithm that exchanges a certain amount of space for efficiency. Low time overhead (O(1)), low space waste, and low conflict rate cannot be achieved at the same time;

  • hash tables are only suitable for the storage of pure memory data structures:

    • The hash table exchanges space for an increase in access speed; the waste of disk space is intolerable, but a small waste of memory is acceptable;

    • Hash tables are only suitable for storage media with fast random access. Data storage on the hard disk mostly uses btrees or other ordered data structures.

  • Most high-level languages ​​(built-in hash table, hash set, etc.) maintain n/b≤

  • The hash table will not distribute keys evenly when n/b is small.

Load Factor:n/b>1

另外一种hash表的实现,专门用来存储比较多的key,当 n/b>1n/b1.0时,线性探测失效(没有足够的bucket存储每个key)。这时1个bucket里不仅存储1个key,一般在一个bucket内用chaining,将所有落在这个bucket的key用链表连接起来,来解决冲突时多个key的存储。

链表只在n/b不是很大时适用。因为链表的查找需要O(n)的时间开销,对于非常大的n/b,有时会用tree替代链表来管理bucket内的key。

n/b值较大的使用场景之一是:将一个网站的用户随机分配到多个不同的web-server上,这时每个web-server可以服务多个用户。多数情况下,我们都希望这种分配能尽可能均匀,从而有效利用每个web-server资源。

这就要求我们关注hash的均匀程度。因此,接下来要讨论的是,假定hash函数完全随机的,均匀程度根据n和b如何变化。

n/b 越大,key的分布越均匀

当 n/b 足够大时,空bucket率趋近于0,且每个bucket中key的数量趋于平均。每个bucket中key数量的期望是:

avg=n/b

定义一个bucket平均key的数量是100%:bucket中key的数量刚好是n/b,下图分别模拟了 b=20,n/b分别为 10、100、1000时,bucket中key的数量分布。

Programmers Advanced Chapter: The Temperament of Hash Table
Programmers Advanced Chapter: The Temperament of Hash Table
Programmers Advanced Chapter: The Temperament of Hash Table

可以看出,当 n/b 增大时,bucket中key数量的最大值与最小值差距在逐渐缩小。下表列出了随b和n/b增大,key分布的均匀程度的变化:

Programmers Advanced Chapter: The Temperament of Hash Table

结论:

Programmers Advanced Chapter: The Temperament of Hash Table

计算

上述大部分结果来自于程序模拟,现在我们来解决从数学上如何计算这些数值。

每类bucket的数量

Programmers Advanced Chapter: The Temperament of Hash Table

空bucket数量

对于1个key, 它不在某个特定的bucket的概率是 (b−1)/b
所有key都不在某个特定的bucket的概率是( (b−1)/b)n

已知:

Programmers Advanced Chapter: The Temperament of Hash Table

空bucket率是:

Programmers Advanced Chapter: The Temperament of Hash Table

空bucket数量为:

Programmers Advanced Chapter: The Temperament of Hash Table

有1个key的bucket数量

n个key中,每个key有1/b的概率落到某个特定的bucket里,其他key以1-(1/b)的概率不落在这个bucket里,因此,对某个特定的bucket,刚好有1个key的概率是:

Programmers Advanced Chapter: The Temperament of Hash Table

刚好有1个key的bucket数量为:

Programmers Advanced Chapter: The Temperament of Hash Table

多个key的bucket

剩下即为含多个key的bucket数量:

Programmers Advanced Chapter: The Temperament of Hash Table

key在bucket中分布的均匀程度

类似的,1个bucket中刚好有i个key的概率是:n个key中任选i个,并都以1/b的概率落在这个bucket里,其他n-i个key都以1-1/b的概率不落在这个bucket里,即:

Programmers Advanced Chapter: The Temperament of Hash Table

这就是著名的二项式分布。

我们可通过二项式分布估计bucket中key数量的最大值与最小值。

通过正态分布来近似

当 n, b 都很大时,二项式分布可以用正态分布来近似估计key分布的均匀性:

p=1/b,1个bucket中刚好有i个key的概率为:

Programmers Advanced Chapter: The Temperament of Hash Table

1个bucket中key数量不多于x的概率是:

Programmers Advanced Chapter: The Temperament of Hash Table

所以,所有不多于x个key的bucket数量是:

Programmers Advanced Chapter: The Temperament of Hash Table

bucket中key数量的最小值,可以这样估算: 如果不多于x个key的bucket数量是1,那么这唯一1个bucket就是最少key的bucket。我们只要找到1个最小的x,让包含不多于x个key的bucket总数为1, 这个x就是bucket中key数量的最小值。

计算key数量的最小值x

一个bucket里包含不多于x个key的概率是:

Programmers Advanced Chapter: The Temperament of Hash Table

Φ(x) 是正态分布的累计分布函数,当x-μ趋近于0时,可以使用以下方式来近似:

Programmers Advanced Chapter: The Temperament of Hash Table

这个函数的计算较难,但只是要找到x,我们可以在[0~μ]的范围内逆向遍历x,以找到一个x 使得包含不多于x个key的bucket期望数量是1。

Programmers Advanced Chapter: The Temperament of Hash Table

x可以认为这个x就是bucket里key数量的最小值,而这个hash表中,不均匀的程度可以用key数量最大值与最小值的差异来描述: 因为正态分布是对称的,所以key数量的最大值可以用 μ + (μ-x) 来表示。最终,bucket中key数量最大值与最小值的比例就是:

Programmers Advanced Chapter: The Temperament of Hash Table

(μ是均值n/b)

程序模拟

以下python脚本模拟了key在bucket中分布的情况,同时可以作为对比,验证上述计算结果。

import sysimport mathimport timeimport hashlibdef normal_pdf(x, mu, sigma):
    x = float(x)
    mu = float(mu)

    m = 1.0 / math.sqrt( 2 * math.pi ) / sigma
    n = math.exp(-(x-mu)**2 / (2*sigma*sigma))return m * ndef normal_cdf(x, mu, sigma):
    # integral(-oo,x)

    x = float(x)
    mu = float(mu)
    sigma = float(sigma)    # to standard form
    x = (x - mu) / sigma

    s = x
    v = x    for i in range(1, 100):
        v = v * x * x / (2*i+1)
        s += v    return 0.5 + s/(2*math.pi)**0.5 * math.e ** (-x*x/2)def difference(nbucket, nkey):

    nbucket, nkey= int(nbucket), int(nkey)    # binomial distribution approximation by normal distribution
    # find the bucket with minimal keys.
    #
    # the probability that a bucket has exactly i keys is:
    #   # probability density function
    #   normal_pdf(i, mu, sigma)
    #
    # the probability that a bucket has 0 ~ i keys is:
    #   # cumulative distribution function
    #   normal_cdf(i, mu, sigma)
    #
    # if the probability that a bucket has 0 ~ i keys is greater than 1/nbucket, we
    # say there will be a bucket in hash table has:
    # (i_0*p_0 + i_1*p_1 + ...)/(p_0 + p_1 + ..) keys.
    p = 1.0 / nbucket
    mu = nkey * p
    sigma = math.sqrt(nkey * p * (1-p))

    target = 1.0 / nbucket
    minimal = mu    while True:

        xx = normal_cdf(minimal, mu, sigma)        if abs(xx-target) < target/10:            break

        minimal -= 1

    return minimal, (mu-minimal) * 2 / (mu + (mu - minimal))def difference_simulation(nbucket, nkey):

    t = str(time.time())
    nbucket, nkey= int(nbucket), int(nkey)

    buckets = [0] * nbucket    for i in range(nkey):
        hsh = hashlib.sha1(t + str(i)).digest()
        buckets[hash(hsh) % nbucket] += 1

    buckets.sort()
    nmin, mmax = buckets[0], buckets[-1]    return nmin, float(mmax - nmin) / mmaxif __name__ == "__main__":

    nbucket, nkey= sys.argv[1:]

    minimal, rate = difference(nbucket, nkey)    print &#39;by normal distribution:&#39;
    print &#39;     min_bucket:&#39;, minimal    print &#39;     difference:&#39;, rate

    minimal, rate = difference_simulation(nbucket, nkey)    print &#39;by simulation:&#39;
    print &#39;     min_bucket:&#39;, minimal    print &#39;     difference:&#39;, rate

The above is the detailed content of Programmer's Advanced Chapter: The Temperament of Hash Table. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
php如何实现Redis的Hash操作php如何实现Redis的Hash操作May 30, 2023 am 08:58 AM

Hash操作//为hash表中的字段赋值。成功返回1,失败返回0。若hash表不存在会先创建表再赋值,若字段已存在会覆盖旧值。$ret=$redis->hSet(&#39;user&#39;,&#39;realname&#39;,&#39;jetwu&#39;);//获取hash表中指定字段的值。若hash表不存在则返回false。$ret=$redis->hGet(&#39;user&#39;,&#39;rea

2023过年,又限制放烟花?程序猿有办法!2023过年,又限制放烟花?程序猿有办法!Jan 20, 2023 pm 02:57 PM

本篇文章给大家介绍如何用前端代码实现一个烟花绽放的绚烂效果,其实主要就是用前端三剑客来实现,也就是HTML+CSS+JS,下面一起来看一下,作者会解说相应的代码,希望对需要的朋友有所帮助。

接私活挣钱!2023程序员接单平台大全!接私活挣钱!2023程序员接单平台大全!Jan 09, 2023 am 09:50 AM

上周我们做了一次关于《2023PHP创业》的公益直播,很多同学咨询具体有哪些接单平台,下面php中文网整理了22个还算靠谱的平台,以供参考!

程序员是做什么的程序员是做什么的Aug 03, 2019 pm 01:40 PM

程序员的工作职责:1、负责软件项目的详细设计、编码和内部测试的组织实施;2、协助项目经理和相关人员同客户进行沟通,保持良好的客户关系;3、参与需求调研、项目可行性分析、技术可行性分析和需求分析;4、熟悉并熟练掌握交付软件部开发的软件项目的相关软件技术;5、负责向项目经理及时反馈软件开发中的情况;6、参与软件开发和维护过程中重大技术问题的解决;7、负责相关技术文档的拟订等等。

Laravel开发:如何使用Laravel Hash生成密码散列?Laravel开发:如何使用Laravel Hash生成密码散列?Jun 17, 2023 am 10:59 AM

Laravel是目前最为流行的PHPweb框架之一,为开发人员提供了许多强大的功能和组件,其中LaravelHash也是其中之一。LaravelHash是一个用于密码散列的PHP库,其可以用于保护密码的安全,并使应用程序的用户数据更加安全。在本文中,我们将了解LaravelHash的工作原理以及如何使用它来对密码进行散列和验证。前置知识在学习Lara

520程序员专属浪漫表白方式!无法拒绝!520程序员专属浪漫表白方式!无法拒绝!May 19, 2022 pm 03:07 PM

520将至,年度虐汪大戏他又双叒叕来啦!想看看最理性的代码和最浪漫的告白究竟能碰撞出怎样的火花?下面带你逐一领略最全最完整的告白代码,看看程序员们的浪漫是否能够掳获各位心目中女神的芳心呢?

浅析怎么下载安装VSCode历史版本浅析怎么下载安装VSCode历史版本Apr 17, 2023 pm 07:18 PM

VSCode历史版本的下载安装 VSCode安装 下载 安装 参考资料 VSCode安装 Windows版本:Windows10 VSCode版本:VScode1.65.0(64位User版本) 本文

2022年最佳的Windows 11终端仿真器列表:Top 15款推荐2022年最佳的Windows 11终端仿真器列表:Top 15款推荐Apr 24, 2023 pm 04:31 PM

终端仿真器允许您模仿标准计算机终端的功能。有了它,您可以执行数据传输并远程访问另一台计算机。当与Windows11等高级操作系统结合使用时,这些工具的创造性可能性是无穷无尽的。但是,有很多第三方终端仿真器可用。因此,很难选择合适的。但是,正如我们对必备的Windows11应用所做的那样,我们选择了您可以使用的最佳终端并提高您的工作效率。我们如何选择最好的Windows11终端模拟器?在选择此列表中的工具之前,我们的专家团队首先测试了它们与Windows11的兼容性。我们还检查了他们

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools