Rumah  >  Artikel  >  pembangunan bahagian belakang  >  [Python] Bagaimanakah kita boleh malas memuatkan modul Python? - menganalisis LazyLoader daripada MLflow

[Python] Bagaimanakah kita boleh malas memuatkan modul Python? - menganalisis LazyLoader daripada MLflow

DDD
DDDasal
2024-10-05 22:10:03498semak imbas

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(sumber imej: https://www.irasutoya.com/2019/03/blog-post_72.html)

Pengenalan

Pada suatu hari saya sedang melompat-lompat di sekitar beberapa perpustakaan ML popular dalam Python, termasuk MLflow. Semasa melihat kod sumbernya, satu kelas menarik minat saya, LazyLoader dalam __init__.py (baik, ini sebenarnya mencerminkan daripada projek wandb, tetapi kod asal telah berubah daripada apa yang MLflow gunakan sekarang, seperti yang anda lihat).

Anda mungkin pernah mendengar tentang konsep lazyloading daripada banyak konteks, seperti pemuatan imej bahagian hadapan web, strategi caching dan sebagainya. Saya rasa intipati semua konsep lazyloading itu ialah, "Saya terlalu malas untuk memuat SEKARANG SEKARANG" - ya, perkataan tersembunyi "sekarang" . Iaitu, aplikasi akan memuatkan dan menggunakan sumber itu hanya apabila ia diperlukan. Jadi di sini dalam perpustakaan MLflow ini, modul dimuatkan hanya apabila sumber di dalamnya — pembolehubah, fungsi dan kelas — diakses.

Tetapi BAGAIMANA? Ini adalah minat utama saya. Jadi saya membaca kod sumber, yang kelihatan sangat mudah pada pandangan pertama. Walau bagaimanapun, secara mengejutkan, ia mengambil sedikit masa untuk memahami cara ia berfungsi, dan saya belajar banyak daripada membaca kod tersebut. Artikel ini adalah tentang menganalisis kod sumber MLflow ini supaya kami memahami cara pemuatan malas tersebut berfungsi menggunakan pelbagai teknik bahasa Python.

Bermain-main dengan LazyLoader

Untuk tujuan analisis kami, saya mencipta pakej ringkas yang dipanggil lazyloading pada mesin tempatan saya, dan meletakkan modul seperti berikut:


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py


  • __init__.py: Fail ini menjadikan keseluruhan direktori menjadi pakej.
  • __main__.py: Fail ini ialah titik masuk apabila kita mahu menjalankan keseluruhan pakej seperti berikut: python -m lazyloading.
  • lazy_load.py: LazyLoader ada dalam fail ini.
  • heavy_module.py: Ini mewakili modul dengan pakej berat untuk dimuatkan (seperti PyTorch) untuk simulasi:

import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”


Seterusnya, kami mengimport modul_berat ini di dalam __main__.py:


if __name__ == "__main__":
    from lazyloading import heavy_module 


Mari jalankan pakej ini dan lihat hasilnya:


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!


Di sini kita dapat melihat dengan jelas bahawa jika kita hanya mengimport pakej berat seperti PyTorch, ia boleh menjadi overhed untuk keseluruhan aplikasi. Itulah sebabnya kami memerlukan lazyloading di sini. Mari tukar __main__.py menjadi seperti ini:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)


Dan hasilnya sepatutnya:


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy


Ya, mana-mana modul yang diimport oleh LazyLoader tidak perlu melaksanakan sebarang skrip atau mengimport pakej lain. Ia berlaku hanya apabila mana-mana atribut modul diakses. Inilah kuasa lazyloading!

Bagaimana LazyLoader berfungsi dalam MLflow? - analisis kod sumber

Kod itu sendiri pendek dan ringkas. Saya menambah anotasi jenis dan beberapa ulasan (baris yang disertakan dalam <, >) untuk penjelasan. Semua komen lain adalah yang terdapat dalam kod sumber asal.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module's namespace, accessible by calling globals()>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)), the full package name (such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
        # <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
        # <here, self.__name__ is the variable `name` in __init__>
        # <this is why that `name` in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of the module to the importing module(=parent module)'s namespace>
        # <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
        self._parent_module_globals[self._local_name] = module

        # <add the module to the list of loaded modules for caching>
        # <see https://docs.python.org/3/reference/import.html#the-module-cache>
        # <this makes possible to import cached module with the variable _local_name
        sys.modules[self._local_name] = module

        # Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
        # lookups are efficient (`__getattr__` is only called on lookups that fail).
        self.__dict__.update(module.__dict__)

        return module

    def __getattr__(self, item: T) -> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module '{self.__name__} (Not loaded yet)'>"
        return repr(self._module)


Sekarang, mari kita siasat kod tersebut sambil malas memuatkan modul_berat kami. Memandangkan kita tidak perlu meniru berat modul lagi, mari kita buang bahagian gelung time.sleep(1).

1. Mencipta contoh LazyLoader, memproksi modul asal

Mari kita lihat __init__() LazyLoader.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)); the full package name(such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 


Kami menyediakan local_name, parent_module_globals dan nama kepada pembina __init__(). Pada masa ini, kami tidak pasti maksud semua itu, tetapi sekurang-kurangnya baris terakhir menunjukkan bahawa kami sebenarnya menjana modul - super().__init__(str(nama)), kerana LazyLoader mewarisi jenis.ModuleType. Dengan memberikan nama pembolehubah, modul kami yang dicipta oleh LazyLoader diiktiraf sebagai modul dengan nama nama (yang sama dengan heavy_module.__name__).

Mencetak modul itu sendiri membuktikan ini:


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)


yang memberikan pada terminal kami:


lazyloading.heavy_module


Walau bagaimanapun, dalam pembina kami hanya memberikan nilai kepada pembolehubah contoh dan memberikan nama modul kepada modul proksi ini. Sekarang, apa yang berlaku apabila kita cuba mengakses atribut modul?

2. Mengakses atribut - __getattribute__, __getattr__ dan getattr

Ini adalah salah satu bahagian yang menyeronokkan dalam kelas ini. Apa yang berlaku apabila kita mengakses atribut objek Python secara umum? Katakan kami mengakses HEAVY_ATTRIBUTE modul_berat dengan memanggil modul_berat.HEAVY_ATTRIBUTE. Daripada kod di sini, atau daripada pengalaman anda sendiri dalam beberapa projek Python, anda mungkin meneka bahawa __getattr__() dipanggil, dan itu sebahagiannya betul. Lihat dokumen rasmi:

Dipanggil apabila akses atribut lalai gagal dengan AttributeError (sama ada getattribute() menimbulkan AttributeError kerana nama bukan atribut instance atau atribut dalam pepohon kelas untuk diri; atau get daripada nama harta benda menimbulkan AttributeError).

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)


When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)



python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy


So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)


Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy


It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module


3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
# <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
# <here, self.__name__ is the variable `name` in __init__>
# <this is why that `name` in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules


According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)



# logs omitted
ModuleNotFoundError: No module named 'heavy_module'


Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)


and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True


This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of the module to the importing module(=parent module)'s namespace>
# <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
self._parent_module_globals[self._local_name] = module

# <add the module to the list of loaded modules for caching>
# <see https://docs.python.org/3/reference/import.html#the-module-cache>
# <this makes possible to import cached module with the variable _local_name
sys.modules[self._local_name] = module


Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?

First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word globals means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)


Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)


Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py


Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE


Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)


Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)


Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in __dict__


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)


In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. __dir__ and __repr__

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module '{self.<em>name</em>_} (Not loaded yet)'>"<br>
return repr(self._module)</p>




Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

Atas ialah kandungan terperinci [Python] Bagaimanakah kita boleh malas memuatkan modul Python? - menganalisis LazyLoader daripada MLflow. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Kenyataan:
Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn