[Python] Python 모듈을 어떻게 지연 로드하나요? - MLflow에서 LazyLoader 분석-파이썬 튜토리얼-php.cn

집

백엔드 개발

파이썬 튜토리얼

[Python] Python 모듈을 어떻게 지연 로드하나요? - MLflow에서 LazyLoader 분석

DDD

Oct 05, 2024 pm 10:10 PM

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(이미지 출처: https://www.irasutoya.com/2019/03/blog-post_72.html)

소개

어느 날 저는 MLflow를 포함하여 Python에서 인기 있는 몇 가지 ML 라이브러리를 둘러보고 있었습니다. 소스 코드를 살펴보는 동안 __init__.py의 LazyLoader라는 클래스가 관심을 끌었습니다(음, 이는 실제로 wandb 프로젝트에서 미러링되었지만 보시다시피 원래 코드는 MLflow가 현재 사용하는 코드와 변경되었습니다).

웹 프런트엔드 이미지 로딩, 캐싱 전략 등 다양한 맥락에서 지연 로딩 개념에 대해 들어보셨을 것입니다. 모든 레이지 로딩 개념의 본질은 "나는 로드 지금 당장하기에는 너무 게으르다"라고 생각합니다. - 네, 숨겨진 단어는 "지금 당장"입니다. . 즉, 애플리케이션은 필요할 때만 해당 리소스를 로드하고 사용합니다. 따라서 이 MLflow 라이브러리에서 모듈은 변수, 함수 및 클래스와 같은 리소스에 액세스할 때만 로드됩니다.

그런데 어떻게? 이것이 나의 주요 관심사였습니다. 그래서 얼핏 보면 매우 간단해 보이는 소스코드를 읽어보았습니다. 하지만 놀랍게도 그것이 어떻게 작동하는지 이해하는 데는 약간의 시간이 걸렸고, 코드를 읽으면서 많은 것을 배웠습니다. 이 기사는 Python 언어의 다양한 기술을 사용하여 이러한 지연 로딩이 어떻게 작동하는지 이해하기 위해 MLflow의 소스 코드를 분석하는 것입니다.

LazyLoader를 가지고 놀기

분석을 위해 로컬 컴퓨터에 lazyloading이라는 간단한 패키지를 만들고 다음과 같이 모듈을 배치했습니다.


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py

__init__.py: 이 파일은 전체 디렉터리를 패키지로 만듭니다.
__main__.py: 이 파일은 다음과 같이 전체 패키지를 실행하려는 진입점입니다: python -m 지연 로딩.
lazy_load.py: LazyLoader가 이 파일에 있습니다.
Heavy_module.py: 이는 시뮬레이션을 위해 로드할 무거운 패키지(예: PyTorch)가 있는 모듈을 나타냅니다.


import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”

다음으로 이 Heavy_module을 __main__.py 내부로 가져옵니다.


if __name__ == "__main__":
    from lazyloading import heavy_module

이 패키지를 실행하고 결과를 확인해 보겠습니다.


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!

여기서 PyTorch와 같은 무거운 패키지를 단순히 가져오면 전체 애플리케이션에 오버헤드가 발생할 수 있음을 분명히 알 수 있습니다. 이것이 바로 여기서 지연 로딩이 필요한 이유입니다. __main__.py를 다음과 같이 변경해 보겠습니다.


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)

결과는 다음과 같습니다.


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy

예, LazyLoader로 가져온 모듈은 스크립트를 실행하거나 다른 패키지를 가져올 필요가 없습니다. 이는 모듈의 속성에 액세스할 때만 발생합니다. 이것이 바로 레이지 로딩의 힘입니다!

MLflow에서 LazyLoader는 어떻게 작동하나요? - 소스코드 분석

코드 자체는 짧고 간단합니다. 설명을 위해 유형 주석과 몇 가지 주석(로 묶인 줄)을 추가했습니다. 그 외 댓글은 모두 원본 소스코드에 있는 댓글입니다.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module namespace accessible by calling globals>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https:>
        # <absolute import importing the module itself from a package rather than top-level only __import__>
        # <here self.__name__ is the variable in __init__>
        # <this is why that in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of module to importing namespace>
        # <so that you can use this module name as a variable inside the importing even if it is called function defined in>
        self._parent_module_globals[self._local_name] = module

        # <add the module to list of loaded modules for caching>
        # <see https:>
        # <this makes possible to import cached module with the variable _local_name sys.modules update this object dict so that if someone keeps a reference lookups are efficient is only called on fail self.__dict__.update return def __getattr__ item: t> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module loaded yet>"
        return repr(self._module)


</module></this></see></add></so></add></this></here></absolute></see></to></actual></importing></the></this>

이제 Heavy_module을 지연 로딩하면서 코드를 조사해 보겠습니다. 더 이상 모듈의 무거움을 시뮬레이션할 필요가 없으므로 time.sleep(1) 루프 부분을 제거하겠습니다.

1. LazyLoader의 인스턴스 생성, 원본 모듈 프록시

LazyLoader의 __init__()를 살펴보겠습니다.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.moduletype the full package name as pkg.subpkg.subsubpkg>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 


</to>

우리는 생성자 __init__()에 local_name, parent_module_globals 및 이름을 제공합니다. 현재로서는 이 모든 것이 무엇을 의미하는지 확실하지 않지만 적어도 마지막 줄은 LazyLoader가 type.ModuleType을 상속하므로 실제로 모듈 super().__init__(str(name))을 생성하고 있음을 나타냅니다. 변수 이름을 제공함으로써 LazyLoader가 생성한 모듈은 이름이 name(heavy_module.__name__과 동일)인 모듈로 인식됩니다.

모듈 자체를 인쇄해 보면 다음과 같습니다.


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)

우리 터미널에서 제공되는 내용:


lazyloading.heavy_module

그러나 생성자에서는 인스턴스 변수에만 값을 할당하고 이 프록시 모듈에는 모듈 이름을 지정했습니다. 이제 모듈의 속성에 액세스하려고 하면 어떻게 되나요?

2. 속성에 액세스 - getattribute, getattr 및 getattr

이것이 이 수업의 재미있는 부분 중 하나입니다. 일반적으로 Python 객체의 속성에 액세스하면 어떻게 되나요? Heavy_module.HEAVY_ATTRIBUTE를 호출하여 Heavy_module의 HEAVY_ATTRIBUTE에 액세스한다고 가정해 보겠습니다. 여기에 있는 코드나 여러 Python 프로젝트에서의 경험을 통해 __getattr__()이 호출되었다고 추측할 수 있으며 이는 부분적으로 정확합니다. 공식 문서를 살펴보세요:

기본 속성 액세스가 AttributeError(getattribute()로 인해 AttributeError를 발생시키는 경우)에 호출됩니다. name이 인스턴스 속성이나 self에 대한 클래스 트리의 속성이 아니기 때문입니다. 또는 get name 속성의 경우 AttributeError가 발생합니다.

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)

When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy

So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)

Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy

It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module

3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https:>
# <absolute import importing the module itself from a package rather than top-level only __import__>
# <here self.__name__ is the variable in __init__>
# <this is why that in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules


</this></here></absolute></see>

According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)


# logs omitted
ModuleNotFoundError: No module named 'heavy_module'

Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)

and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True

This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of module to importing namespace>
# <so that you can use this module name as a variable inside the importing even if it is called function defined in>
self._parent_module_globals[self._local_name] = module

# <add the module to list of loaded modules for caching>
# <see https:>
# <this makes possible to import cached module with the variable _local_name sys.modules>

<p>Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?</p>

<p>First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word <em>globals</em> means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:</p>

<pre class="brush:php;toolbar:false">

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)

Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module from> # the path of the heavy_module(a Python file)


</module>

Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py

Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE

Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)

Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module from> # the path of the heavy_module(a Python file)


</module>

Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in dict


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)

In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. dir and repr

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module>name_} (Not loaded yet)'>"<br>
return repr(self._module)</module></p>

Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

위 내용은 [Python] Python 모듈을 어떻게 지연 로드하나요? - MLflow에서 LazyLoader 분석의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명