[Python] 如何延遲載入 Python 模組？ - 從 MLflow 分析 LazyLoader-Python教學-PHP中文網

首頁

後端開發

Python教學

[Python] 如何延遲載入 Python 模組？ - 從 MLflow 分析 LazyLoader

DDD

Oct 05, 2024 pm 10:10 PM

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(image source: https://www.irasutoya.com/2019/03/blog-post_72.html)

Intro

One day I was hopping around a few popular ML libraries in Python, including MLflow. While glancing at its source code, one class attracted my interest, LazyLoader in __init__.py (well, this actually mirrors from the wandb project, but the original code has changed from what MLflow is using now, as you can see).

You probably heard about the concept of lazyloading from many contexts, such as web frontend image loading, caching strategy, and so on. I think the essence of all those lazyloading concepts is, that "I am too lazy to load RIGHT NOW" - yes, the hidden words "right now". Namely, the application will load and use that resource only when it is needed. So here in this MLflow library, the modules are loaded only when the resources in it — variables, functions, and classes — are accessed.

But HOW? This was my main interest. So I read the source code, which looked very simple at first glance. However, surprisingly, it took a bit of time to understand how it works, and I learned a lot from reading the code. This article is about analyzing this source code of MLflow so that we understand how such lazyloading works using various techniques of Python language.

Playing around with LazyLoader

For the purpose of our analysis, I created a simple package called lazyloading on my local machine, and placed modules as follows:


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py

__init__.py: This file makes the entire directory into a package.
__main__.py: This file is the entry point when we want to run the entire package as follows: python -m lazyloading.
lazy_load.py: LazyLoader is in this file.
heavy_module.py: This represents a module with heavy packages to be loaded (such as PyTorch) for a simulation:


import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”

Next, we import this heavy_module inside __main__.py:


if __name__ == "__main__":
    from lazyloading import heavy_module

Let’s run this package and see the result:


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!

Here we can clearly see that if we simply import heavy packages such as PyTorch, it could be an overhead for the entire application. That’s why we need lazyloading here. Let’s change __main__.py to look like this:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)

And the result should be:


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy

Yes, any module imported by LazyLoader doesn’t need to execute any script or import other packages. It happens only when any attribute of the module is accessed. This is the power of lazyloading!

How LazyLoader works in MLflow? - source code analysis

The code itself is short and simple. I added type annotations and a few comments (lines enclosed in ) for explanations. All the other comments are the ones in the original source code.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module namespace accessible by calling globals>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https:>
        # <absolute import importing the module itself from a package rather than top-level only __import__>
        # <here self.__name__ is the variable in __init__>
        # <this is why that in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of module to importing namespace>
        # <so that you can use this module name as a variable inside the importing even if it is called function defined in>
        self._parent_module_globals[self._local_name] = module

        # <add the module to list of loaded modules for caching>
        # <see https:>
        # <this makes possible to import cached module with the variable _local_name sys.modules update this object dict so that if someone keeps a reference lookups are efficient is only called on fail self.__dict__.update return def __getattr__ item: t> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module loaded yet>"
        return repr(self._module)


</module></this></see></add></so></add></this></here></absolute></see></to></actual></importing></the></this>

Now, let’s investigate the code while lazyloading our heavy_module. Since we don’t need to simulate the heaviness of the module anymore, let’s get rid of the time.sleep(1) loop part.

1. Creating an instance of LazyLoader, proxying the original module

Let’s look at __init__() of LazyLoader.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 


</to>

We provide local_name, parent_module_globals, and name to the constructor __init__(). At the moment, we are not sure what all those means, but at least the last line indicates that we are actually generating a module - super().__init__(str(name)), since LazyLoader inherits types.ModuleType. By providing the variable name, our module created by LazyLoader is recognized as a module with name name(which is the same as heavy_module.__name__).

Printing out the module itself proves this:


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)

which gives on our terminal:


lazyloading.heavy_module

However, in the constructor we only assigned values to the instance variables and gave the name of the module to this proxy module. Now, what happens when we try to access an attribute of the module?

2. Accessing an attribute - getattribute, getattr, and getattr

This is one of the fun parts of this class. What happens when we access an attribute of a Python object in general? Say we access HEAVY_ATTRIBUTE of heavy_module by calling heavy_module.HEAVY_ATTRIBUTE. From the code here, or from your own experience in several Python projects, you might guess that __getattr__() is called, and that’s partially correct. Look at the official docs:

Called when the default attribute access fails with an AttributeError (either getattribute() raises an AttributeError because name is not an instance attribute or an attribute in the class tree for self; or get of a name property raises AttributeError).

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)

When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy

So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)

Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy

It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module

3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https:>
# <absolute import importing the module itself from a package rather than top-level only __import__>
# <here self.__name__ is the variable in __init__>
# <this is why that in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules


</this></here></absolute></see>

According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


# logs omitted
ModuleNotFoundError: No module named 'heavy_module'

Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)

and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True

This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of module to importing namespace>
# <so that you can use this module name as a variable inside the importing even if it is called function defined in>
self._parent_module_globals[self._local_name] = module

# <add the module to list of loaded modules for caching>
# <see https:>
# <this makes possible to import cached module with the variable _local_name sys.modules>

<p>Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?</p>

<p>First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word <em>globals</em> means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:</p>

<pre class="brush:php;toolbar:false">

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)

Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module from> # the path of the heavy_module(a Python file)


</module>

Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py

Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE

Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)

Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module from> # the path of the heavy_module(a Python file)


</module>

Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in dict


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)

In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. dir and repr

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module>name_} (Not loaded yet)'>"<br>
return repr(self._module)</module></p>

Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

以上是[Python] 如何延遲載入 Python 模組？ - 從 MLflow 分析 LazyLoader的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

您如何將元素附加到Python列表中？May 04, 2025 am 12:17 AM

toAppendElementStoApythonList，usetheappend（）方法forsingleements，Extend（）formultiplelements，andinsert（）forspecificpositions.1）useeAppend（）foraddingoneOnelementAttheend.2）useextendTheEnd.2）useextendexendExendEnd（

您如何創建Python列表？舉一個例子。May 04, 2025 am 12:16 AM

TocreateaPythonlist,usesquarebrackets[]andseparateitemswithcommas.1)Listsaredynamicandcanholdmixeddatatypes.2)Useappend(),remove(),andslicingformanipulation.3)Listcomprehensionsareefficientforcreatinglists.4)Becautiouswithlistreferences;usecopy()orsl

討論有效存儲和數值數據的處理至關重要的實際用例。May 04, 2025 am 12:11 AM

金融、科研、医疗和AI等领域中，高效存储和处理数值数据至关重要。1)在金融中，使用内存映射文件和NumPy库可显著提升数据处理速度。2)科研领域，HDF5文件优化数据存储和检索。3)医疗中，数据库优化技术如索引和分区提高数据查询性能。4)AI中，数据分片和分布式训练加速模型训练。通过选择适当的工具和技术，并权衡存储与处理速度之间的trade-off，可以显著提升系统性能和可扩展性。

您如何創建Python數組？舉一個例子。May 04, 2025 am 12:10 AM

pythonarraysarecreatedusiseThearrayModule，notbuilt-Inlikelists.1）importThearrayModule.2）指定tefifythetypecode，例如，'i'forineizewithvalues.arreaysofferbettermemoremorefferbettermemoryfforhomogeNogeNogeNogeNogeNogeNogeNATATABUTESFELLESSFRESSIFERSTEMIFICETISTHANANLISTS。

使用Shebang系列指定Python解釋器有哪些替代方法？May 04, 2025 am 12:07 AM

除了shebang線，還有多種方法可以指定Python解釋器：1.直接使用命令行中的python命令；2.使用批處理文件或shell腳本；3.使用構建工具如Make或CMake；4.使用任務運行器如Invoke。每個方法都有其優缺點，選擇適合項目需求的方法很重要。

列表和陣列之間的選擇如何影響涉及大型數據集的Python應用程序的整體性能？May 03, 2025 am 12:11 AM

ForhandlinglargedatasetsinPython,useNumPyarraysforbetterperformance.1)NumPyarraysarememory-efficientandfasterfornumericaloperations.2)Avoidunnecessarytypeconversions.3)Leveragevectorizationforreducedtimecomplexity.4)Managememoryusagewithefficientdata

說明如何將內存分配給Python中的列表與數組。May 03, 2025 am 12:10 AM

Inpython，ListSusedynamicMemoryAllocationWithOver-Asalose，而alenumpyArraySallaySallocateFixedMemory.1）listssallocatemoremoremoremorythanneededinentientary上，respizeTized.2）numpyarsallaysallaysallocateAllocateAllocateAlcocateExactMemoryForements，OfferingPrediCtableSageButlessemageButlesseflextlessibility。