What is the principle of Dict implementation in Python?-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

What is the principle of Dict implementation in Python?

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 19, 2023 pm 10:37 PM

pythondict

1. Implementation of unordered Dict

Dict can quickly find keys, thanks to its space-for-time strategy and hash table implementation. When reading and writing the Key, the Key will be hashed (so the Key is required to be an immutable type. If it is a variable type, its hash value cannot be calculated), and then based on the calculated value , and the current array space length is calculated moduloly, and the obtained value is the subscript of the current Key in the array. Finally, through the subscript, the value can be read with a time complexity of O(1). This implementation is great, and it is A common approach to distribution, but there are also problems. What should I do if the array is full or have different keys, but the hash results are the same?

The solution to the first problem is in the appropriate When the capacity is expanded, in Python, when the number placed in the Dict accounts for 2/3 of the capacity, the Dict will begin to expand. The total capacity after expansion is doubled before the expansion. This is to reduce Frequent expansion leads to more key migrations;

There are two solutions to the second problem:

Link method: The original array contains The value corresponding to the Key, and the array of the chaining method stores an array. This array stores an array containing the key and the corresponding value, as shown below. Assuming that the hash results of key1 and key2 are both 0, it will be inserted into In the 0 subscript of the array, key1 is at the first position of the 0 subscript array, and when key2 is inserted, it is found that key1 already exists. Then compare key2 with key1 and find that their keys are actually different, then Append at 0 subscript.

array = [
	[
    # 分别为key, hash值， 数值
		(&#39;key1&#39;, 123, 123),
		(&#39;key2&#39;, 123, 123)
	],
	[
		(&#39;key3&#39;, 123, 123)
	]
]

Development of addressing method: The development of addressing method adopts another idea, adopting the idea of borrowing, when inserting data, if a conflict is encountered Then use the next digit of the current subscript. If the next digit still conflicts, continue to use the next digit. When looking for data, the keys corresponding to the hash values will be compared. If there is a value and it does not match, find it. Next bit, until or until a vacancy is found.

The implementation of the above two solutions is very simple, and it is easy to know their advantages and disadvantages by comparison:

Advantages of the linked list method:

It is convenient to delete records. You can directly process the subarray corresponding to the subscript of the array.
The average search speed is fast. If there is a conflict, you only need to query the subarray

Disadvantages of the linked list method:

uses pointers, which results in slower query speed, higher memory usage, and is not suitable for serialization. The advantages and disadvantages of the open addressing method are the opposite of the linked list method. Since everything in Python is based on Dict and needs to be serialized, the open addressing method was chosen.

By comparison Both the linked list method and the open search method can be found. They are both solutions to hash conflicts. If the array storing data is large enough, then the possibility of hash conflicts will be very small, and there is no need to frequently expand and migrate data, but it will occupy more space. The space will be very large. Therefore, the initial value of a good hash table implementation cannot be too large. The initial value of Dict in Python is 8. In addition, the hash table also needs to allow the array of data to be stored. The unused vacancies remain fluctuating within a range, so that space usage and the probability of hash collisions are maintained at an optimal situation. However, since each expansion consumes a lot of performance, it cannot be changed every time. Expansion, so a value needs to be determined. When the unused/used ratio reaches this value, the capacity will be automatically expanded. In the Dict of Python, this value is 2/3. That is, when the Dict is used After 2/3 of the space is occupied, it will automatically expand to reach a new optimal balance. At the same time, in order to reduce the number of key migrations during each expansion, the total capacity after expansion must be one-third of the total capacity before expansion. times, in this case, only half the number of keys need to be migrated.

The reason why doubling the hash table will only migrate half of the keys is that the subscript of the key in the array is obtained by taking the hash value Modulo implementation, for example, a hash table capacity is 8, a key with a hash value of 20 takes a modulo value of 4, and after the hash table is expanded, the length becomes 16. At this time, the modulo result is still 4. A key with a hash value of 11 has a modulo value of 3, and after expansion, the modulo value is 11. It can be clearly seen that after the hash table is expanded, the keys whose hash value is greater than the original capacity do not need to be migrated, while the keys whose hash value is less than the capacity need to be migrated.

But if you follow the above statement, there is still a problem with the open addressing method, such as the following array:

arrray = [None, None, None, None, True, True, True, True]

以True代表当前数组的位置已经被占用, None代表未被占用, 可以看出当前并未满足使用了数组的2/3空间, 数组还未到扩容阶段。此时假设要插入一个Key, 刚好落在数组下标4, 那么插入的时候就要一直查询下去, 最后发现只有数组下标0的位置的空的, 才可以真正的插入数据. 对于一个长度只有8的数组, 需要执行5次才能插入数据, 那也太浪费性能了, 所以Python要实现一个算法, 尽量让冲突的数据插在别的地方. 在源码中, Python用到了公式x = ((5*y) + 1) % 2**i来实现跳跃插入冲突数据. 式子中x为数组的下一个坐标, y为当前发生冲突的坐标,i为容量系数, 比如初始化时, i为3, 那么容量就是8了, 第一次插入数据到坐标0冲突时, y = 0, 带入公式后, 求得x 等于1, 第二次插入数据到坐标0时, y = 1, 求得x 等于 6, 通过这样算下去, 可以发现数字生成轨迹是0>1>6>7>4>5>2>3>0一直循环着, 这样跳着插数据就能完美解决上面场景的问题.

2.有序Dict的原理

Python3.6之前说的差不多, 它的数组大概是长这样的, 数组中存了子数组, 第一项为hash值, 第二项为key, 第三项为值.

array = [
	[],
	[123456, "key1", 1],
	[],
	[],
	[],
	[234567, "key2", 2],
	[],
	[]
]

这种实现的空间大小在初始化时就固定了, 直到下次扩容再发送变化, 在遍历字典时, 实际上就是遍历数组, 只是把没有占用的空间进行跳过.可以看出这种遍历的生成的顺序只跟哈希结果相关, 无法跟插入顺序相关, 所以这种方法的实现是无序的(同时由于每次启动程序时，他们的哈希计算是不一样的，所以每次遍历的顺序也就各不相同了).

而在Python3.7之后, Python的Dict使用了新的数据结构, 使Python新Dict的内存占用也比老的Dict少了, 同时新的Dict在遍历时是跟插入顺序是一致的, 具体的实现是, 初始化时会生成两个数组, 插入值时, 在数组二追加当前的数据, 并获得当前追加数据所在的下标A, 然后对key进行哈希取模算出下标B, 最后对数组下标B的值更新为A, 简单的演示如下:

# 初始的结构
# -1代表还未插入数据
array_1 = [-1, -1, -1, -1, -1, -1, -1, -1]
array_2 = []


# 插入值后, 他就会变为:
array_1 = [-1, 0, -1, -1, -1, 1, -1, -1]
array_2 = [
	[123456, "key1", 1],
	[234567, "key2", 2],
]

可以看出, 数组2的容量跟当前放入的值相等的, 数组1虽然还会保持1/3的空闲标记位, 但他只保存数组二的下标, 占用空间也不多, 相比之前的方案会节省一些空间, 同时在遍历的时候可以直接遍历数组2, 这样Python的Dict就变为有序的了. 注: 为了保持操作高效, 在删除的时候, 只是把数组1的值设置为-2, 并把数组2对应的值设置为None, 而不去删除它, 在查找时会忽略掉数组1中值为-2的元素，在遍历时会忽略掉值为None的元素.

3.有序字典的实现

通过上面的了解后, 可以使用Python来写一个新版Dict的实现, 具体说明见注释:

from typing import Any, Iterable, List, Optional, Tuple


class CustomerDict(object):

    def __init__(self):
        self._init_seed: int = 3  # 容量因子
        self._init_length: int = 2 ** self._init_seed  # 初始化数组大小
        self._load_factor: float = 2 / 3  # 扩容因子
        self._index_array: List[int] = [-1 for _ in range(self._init_length)]  # 存放下标的数组
        self._data_array: List[Optional[Tuple[int, Any, Any]]] = []  # 存放数据的数组
        self._used_count: int = 0  # 目前用的量
        self._delete_count: int = 0  # 被标记删除的量

    def _create_new(self):
        """扩容函数"""
        self._init_seed += 1  # 增加容量因子
        self._init_length = 2 ** self._init_seed
        old_data_array: List[Tuple[int, Any, Any]] = self._data_array
        self._index_array: List[int] = [-1 for _ in range(self._init_length)]
        self._data_array: List[Tuple[int, Any, Any]] = []
        self._used_count = 0
        self._delete_count = 0

        # 这里只是简单实现, 实际上只需要搬运一半的数据
        for item in old_data_array:
            if item is not None:
                self.__setitem__(item[1], item[2])

    def _get_next(self, index: int):
        """计算冲突的下一跳，如果下标对应的值冲突了, 需要计算下一跳的下标"""
        return ((5*index) + 1) % self._init_length

    def _core(self, key: Any, default_value: Optional[Any] = None) -> Tuple[int, Any, int]:
        """获取数据或者得到可以放新数据的方法, 返回值是index_array的索引, 数据, data_array的索引"""
        index: int = hash(key) % (self._init_length - 1)
        while True:
            data_index: int = self._index_array[index]
            # 如果是-1则代表没有数据
            if data_index == -1:
                break
            # 如果是-2则代表之前有数据则不过被删除了
            elif data_index == -2:
                index = self._get_next(index)
                continue

            _, new_key, default_value = self._data_array[data_index]
            # 判断是不是对应的key
            if key != new_key:
                index = self._get_next(index)
            else:
                break
        return index, default_value, data_index

    def __getitem__(self, key: Any) -> Any:
        _, value, data_index = self._core(key)
        if data_index == -1:
            raise KeyError(key)
        return value

    def __setitem__(self, key: Any, value: Any) -> None:
        if (self._used_count / self._init_length) > self._load_factor:
            self._create_new()
        index, _, _ = self._core(key)

        self._index_array[index] = self._used_count
        self._data_array.append((hash(key), key, value))
        self._used_count += 1

    def __delitem__(self, key: Any) -> None:
        index, _, data_index = self._core(key)
        if data_index == -1:
            raise KeyError(key)
        self._index_array[index] = -2
        self._data_array[data_index] = None
        self._delete_count += 1

    def __len__(self) -> int:
        return self._used_count - self._delete_count

    def __iter__(self) -> Iterable:
        return iter(self._data_array)
    
    def __str__(self) -> str:
        return str({item[1]: item[2] for item in self._data_array if item is not None})

    def keys(self) -> List[Any]:
        """模拟实现keys方法"""
        return [item[1] for item in self._data_array if item is not None]

    def values(self) -> List[Any]:
        """模拟实现values方法"""
        return [item[2] for item in self._data_array if item is not None]

    def items(self) -> List[Tuple[Any, Any]]:
        """模拟实现items方法"""
        return [(item[1], item[2]) for item in self._data_array if item is not None]


if __name__ == &#39;__main__&#39;:
    customer_dict: CustomerDict = CustomerDict()
    customer_dict["demo_1"] = "a"
    customer_dict["demo_2"] = "b"
    assert len(customer_dict) == 2

    del customer_dict["demo_1"]
    del customer_dict["demo_2"]
    assert len(customer_dict) == 0

    for i in range(30):
        customer_dict[i] = i
    assert len(customer_dict) == 30

    customer_dict_value_list: List[Any] = customer_dict.values()
    for i in range(30):
        assert i == customer_dict[i]

    for i in range(30):
        assert customer_dict[i] == i
        del customer_dict[i]
    assert len(customer_dict) == 0

The above is the detailed content of What is the principle of Dict implementation in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete

Python vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

4 weeks agoByDDD

Atomfall guide: item locations, quest guides, and tips

4 weeks agoByDDD

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1

Easy-to-use and free code editor

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Hot Topics

Where is the login entrance for gmail email?

7652

CakePHP Tutorial

1393

What is the format of the account name of steam

win11 activation key permanent

nyt mini crossword answers

110