[Python] 如何延迟加载 Python 模块？ - 从 MLflow 分析 LazyLoader-Python教程-PHP中文网

首页

后端开发

Python教程

[Python] 如何延迟加载 Python 模块？ - 从 MLflow 分析 LazyLoader

DDD

Oct 05, 2024 pm 10:10 PM

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(image source: https://www.irasutoya.com/2019/03/blog-post_72.html)

Intro

One day I was hopping around a few popular ML libraries in Python, including MLflow. While glancing at its source code, one class attracted my interest, LazyLoader in __init__.py (well, this actually mirrors from the wandb project, but the original code has changed from what MLflow is using now, as you can see).

You probably heard about the concept of lazyloading from many contexts, such as web frontend image loading, caching strategy, and so on. I think the essence of all those lazyloading concepts is, that "I am too lazy to load RIGHT NOW" - yes, the hidden words "right now". Namely, the application will load and use that resource only when it is needed. So here in this MLflow library, the modules are loaded only when the resources in it — variables, functions, and classes — are accessed.

But HOW? This was my main interest. So I read the source code, which looked very simple at first glance. However, surprisingly, it took a bit of time to understand how it works, and I learned a lot from reading the code. This article is about analyzing this source code of MLflow so that we understand how such lazyloading works using various techniques of Python language.

Playing around with LazyLoader

For the purpose of our analysis, I created a simple package called lazyloading on my local machine, and placed modules as follows:


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py

__init__.py: This file makes the entire directory into a package.
__main__.py: This file is the entry point when we want to run the entire package as follows: python -m lazyloading.
lazy_load.py: LazyLoader is in this file.
heavy_module.py: This represents a module with heavy packages to be loaded (such as PyTorch) for a simulation:


import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”

Next, we import this heavy_module inside __main__.py:


if __name__ == "__main__":
    from lazyloading import heavy_module

Let’s run this package and see the result:


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!

Here we can clearly see that if we simply import heavy packages such as PyTorch, it could be an overhead for the entire application. That’s why we need lazyloading here. Let’s change __main__.py to look like this:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)

And the result should be:


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy

Yes, any module imported by LazyLoader doesn’t need to execute any script or import other packages. It happens only when any attribute of the module is accessed. This is the power of lazyloading!

How LazyLoader works in MLflow? - source code analysis

The code itself is short and simple. I added type annotations and a few comments (lines enclosed in ) for explanations. All the other comments are the ones in the original source code.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module namespace accessible by calling globals>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https:>
        # <absolute import importing the module itself from a package rather than top-level only __import__>
        # <here self.__name__ is the variable in __init__>
        # <this is why that in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of module to importing namespace>
        # <so that you can use this module name as a variable inside the importing even if it is called function defined in>
        self._parent_module_globals[self._local_name] = module

        # <add the module to list of loaded modules for caching>
        # <see https:>
        # <this makes possible to import cached module with the variable _local_name sys.modules update this object dict so that if someone keeps a reference lookups are efficient is only called on fail self.__dict__.update return def __getattr__ item: t> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module loaded yet>"
        return repr(self._module)


</module></this></see></add></so></add></this></here></absolute></see></to></actual></importing></the></this>

Now, let’s investigate the code while lazyloading our heavy_module. Since we don’t need to simulate the heaviness of the module anymore, let’s get rid of the time.sleep(1) loop part.

1. Creating an instance of LazyLoader, proxying the original module

Let’s look at __init__() of LazyLoader.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 


</to>

We provide local_name, parent_module_globals, and name to the constructor __init__(). At the moment, we are not sure what all those means, but at least the last line indicates that we are actually generating a module - super().__init__(str(name)), since LazyLoader inherits types.ModuleType. By providing the variable name, our module created by LazyLoader is recognized as a module with name name(which is the same as heavy_module.__name__).

Printing out the module itself proves this:


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)

which gives on our terminal:


lazyloading.heavy_module

However, in the constructor we only assigned values to the instance variables and gave the name of the module to this proxy module. Now, what happens when we try to access an attribute of the module?

2. Accessing an attribute - getattribute, getattr, and getattr

This is one of the fun parts of this class. What happens when we access an attribute of a Python object in general? Say we access HEAVY_ATTRIBUTE of heavy_module by calling heavy_module.HEAVY_ATTRIBUTE. From the code here, or from your own experience in several Python projects, you might guess that __getattr__() is called, and that’s partially correct. Look at the official docs:

Called when the default attribute access fails with an AttributeError (either getattribute() raises an AttributeError because name is not an instance attribute or an attribute in the class tree for self; or get of a name property raises AttributeError).

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)

When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy

So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)

Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy

It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module

3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https:>
# <absolute import importing the module itself from a package rather than top-level only __import__>
# <here self.__name__ is the variable in __init__>
# <this is why that in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules


</this></here></absolute></see>

According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


# logs omitted
ModuleNotFoundError: No module named 'heavy_module'

Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)

and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True

This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of module to importing namespace>
# <so that you can use this module name as a variable inside the importing even if it is called function defined in>
self._parent_module_globals[self._local_name] = module

# <add the module to list of loaded modules for caching>
# <see https:>
# <this makes possible to import cached module with the variable _local_name sys.modules>

<p>Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?</p>

<p>First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word <em>globals</em> means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:</p>

<pre class="brush:php;toolbar:false">

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)

Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module from> # the path of the heavy_module(a Python file)


</module>

Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py

Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE

Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)

Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module from> # the path of the heavy_module(a Python file)


</module>

Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in dict


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)

In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. dir and repr

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module>name_} (Not loaded yet)'>"<br>
return repr(self._module)</module></p>

Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

以上是[Python] 如何延迟加载 Python 模块？ - 从 MLflow 分析 LazyLoader的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

python中两个列表的串联替代方案是什么？May 09, 2025 am 12:16 AM

可以使用多种方法在Python中连接两个列表：1.使用操作符，简单但在大列表中效率低；2.使用extend方法，效率高但会修改原列表；3.使用 =操作符，兼具效率和可读性；4.使用itertools.chain函数，内存效率高但需额外导入；5.使用列表解析，优雅但可能过于复杂。选择方法应根据代码上下文和需求。

Python：合并两个列表的有效方法May 09, 2025 am 12:15 AM

有多种方法可以合并Python列表：1.使用操作符，简单但对大列表不内存高效；2.使用extend方法，内存高效但会修改原列表；3.使用itertools.chain，适用于大数据集；4.使用*操作符，一行代码合并小到中型列表；5.使用numpy.concatenate，适用于大数据集和性能要求高的场景；6.使用append方法，适用于小列表但效率低。选择方法时需考虑列表大小和应用场景。

编译的与解释的语言：优点和缺点May 09, 2025 am 12:06 AM

CompiledLanguagesOffersPeedAndSecurity，而interneterpretledlanguages provideeaseafuseanDoctability.1）commiledlanguageslikec arefasterandSecureButhOnderDevevelmendeclementCyclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesandentency.2）cransportedeplatectentysenty

Python：对于循环，最完整的指南May 09, 2025 am 12:05 AM

Python中，for循环用于遍历可迭代对象，while循环用于条件满足时重复执行操作。1）for循环示例：遍历列表并打印元素。2）while循环示例：猜数字游戏，直到猜对为止。掌握循环原理和优化技巧可提高代码效率和可靠性。

python concatenate列表到一个字符串中May 09, 2025 am 12:02 AM

要将列表连接成字符串，Python中使用join()方法是最佳选择。1)使用join()方法将列表元素连接成字符串，如''.join(my_list)。2)对于包含数字的列表，先用map(str,numbers)转换为字符串再连接。3)可以使用生成器表达式进行复杂格式化，如','.join(f'({fruit})'forfruitinfruits)。4)处理混合数据类型时，使用map(str,mixed_list)确保所有元素可转换为字符串。5)对于大型列表，使用''.join(large_li

Python的混合方法：编译和解释合并May 08, 2025 am 12:16 AM

pythonuseshybridapprace，ComminingCompilationTobyTecoDeAndInterpretation.1）codeiscompiledtoplatform-Indepententbybytecode.2）bytecodeisisterpretedbybythepbybythepythonvirtualmachine，增强效率和通用性。

了解python的' for”和' then”循环之间的差异May 08, 2025 am 12:11 AM

theKeyDifferencesBetnewpython's“ for”和“ for”和“ loopsare：1）” for“ loopsareIdealForiteringSequenceSquencesSorkNowniterations，而2）”，而“ loopsareBetterforConterContinuingUntilacTientInditionIntionismetismetistismetistwithOutpredefinedInedIterations.un