首页 >后端开发 >Python教程 >[Python] 如何延迟加载 Python 模块? - 从 MLflow 分析 LazyLoader

[Python] 如何延迟加载 Python 模块? - 从 MLflow 分析 LazyLoader

DDD
DDD原创
2024-10-05 22:10:03527浏览

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(image source: https://www.irasutoya.com/2019/03/blog-post_72.html)

Intro

One day I was hopping around a few popular ML libraries in Python, including MLflow. While glancing at its source code, one class attracted my interest, LazyLoader in __init__.py (well, this actually mirrors from the wandb project, but the original code has changed from what MLflow is using now, as you can see).

You probably heard about the concept of lazyloading from many contexts, such as web frontend image loading, caching strategy, and so on. I think the essence of all those lazyloading concepts is, that "I am too lazy to load RIGHT NOW" - yes, the hidden words "right now". Namely, the application will load and use that resource only when it is needed. So here in this MLflow library, the modules are loaded only when the resources in it — variables, functions, and classes — are accessed.

But HOW? This was my main interest. So I read the source code, which looked very simple at first glance. However, surprisingly, it took a bit of time to understand how it works, and I learned a lot from reading the code. This article is about analyzing this source code of MLflow so that we understand how such lazyloading works using various techniques of Python language.

Playing around with LazyLoader

For the purpose of our analysis, I created a simple package called lazyloading on my local machine, and placed modules as follows:


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py


  • __init__.py: This file makes the entire directory into a package.
  • __main__.py: This file is the entry point when we want to run the entire package as follows: python -m lazyloading.
  • lazy_load.py: LazyLoader is in this file.
  • heavy_module.py: This represents a module with heavy packages to be loaded (such as PyTorch) for a simulation:

import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”


Next, we import this heavy_module inside __main__.py:


if __name__ == "__main__":
    from lazyloading import heavy_module 


Let’s run this package and see the result:


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!


Here we can clearly see that if we simply import heavy packages such as PyTorch, it could be an overhead for the entire application. That’s why we need lazyloading here. Let’s change __main__.py to look like this:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)


And the result should be:


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy


Yes, any module imported by LazyLoader doesn’t need to execute any script or import other packages. It happens only when any attribute of the module is accessed. This is the power of lazyloading!

How LazyLoader works in MLflow? - source code analysis

The code itself is short and simple. I added type annotations and a few comments (lines enclosed in <, >) for explanations. All the other comments are the ones in the original source code.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module's namespace, accessible by calling globals()>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)), the full package name (such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
        # <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
        # <here, self.__name__ is the variable `name` in __init__>
        # <this is why that `name` in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of the module to the importing module(=parent module)'s namespace>
        # <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
        self._parent_module_globals[self._local_name] = module

        # <add the module to the list of loaded modules for caching>
        # <see https://docs.python.org/3/reference/import.html#the-module-cache>
        # <this makes possible to import cached module with the variable _local_name
        sys.modules[self._local_name] = module

        # Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
        # lookups are efficient (`__getattr__` is only called on lookups that fail).
        self.__dict__.update(module.__dict__)

        return module

    def __getattr__(self, item: T) -> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module '{self.__name__} (Not loaded yet)'>"
        return repr(self._module)


Now, let’s investigate the code while lazyloading our heavy_module. Since we don’t need to simulate the heaviness of the module anymore, let’s get rid of the time.sleep(1) loop part.

1. Creating an instance of LazyLoader, proxying the original module

Let’s look at __init__() of LazyLoader.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)); the full package name(such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 


We provide local_name, parent_module_globals, and name to the constructor __init__(). At the moment, we are not sure what all those means, but at least the last line indicates that we are actually generating a module - super().__init__(str(name)), since LazyLoader inherits types.ModuleType. By providing the variable name, our module created by LazyLoader is recognized as a module with name name(which is the same as heavy_module.__name__).

Printing out the module itself proves this:


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)


which gives on our terminal:


lazyloading.heavy_module


However, in the constructor we only assigned values to the instance variables and gave the name of the module to this proxy module. Now, what happens when we try to access an attribute of the module?

2. Accessing an attribute - __getattribute__, __getattr__, and getattr

This is one of the fun parts of this class. What happens when we access an attribute of a Python object in general? Say we access HEAVY_ATTRIBUTE of heavy_module by calling heavy_module.HEAVY_ATTRIBUTE. From the code here, or from your own experience in several Python projects, you might guess that __getattr__() is called, and that’s partially correct. Look at the official docs:

Called when the default attribute access fails with an AttributeError (either getattribute() raises an AttributeError because name is not an instance attribute or an attribute in the class tree for self; or get of a name property raises AttributeError).

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)


When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)



python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy


So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)


Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy


It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module


3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
# <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
# <here, self.__name__ is the variable `name` in __init__>
# <this is why that `name` in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules


According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)



# logs omitted
ModuleNotFoundError: No module named 'heavy_module'


Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)


and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True


This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of the module to the importing module(=parent module)'s namespace>
# <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
self._parent_module_globals[self._local_name] = module

# <add the module to the list of loaded modules for caching>
# <see https://docs.python.org/3/reference/import.html#the-module-cache>
# <this makes possible to import cached module with the variable _local_name
sys.modules[self._local_name] = module


Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?

First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word globals means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)


Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)


Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py


Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE


Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)


Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)


Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in __dict__


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)


In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. __dir__ and __repr__

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module '{self.<em>name</em>_} (Not loaded yet)'>"<br>
return repr(self._module)</p>




Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

以上是[Python] 如何延迟加载 Python 模块? - 从 MLflow 分析 LazyLoader的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn