Maison >développement back-end >Tutoriel Python >[Python] Comment charger paresseusement un module Python ? - analyser LazyLoader de MLflow
(source de l'image : https://www.irasutoya.com/2019/03/blog-post_72.html)
Un jour, je parcourais quelques bibliothèques ML populaires en Python, dont MLflow. En jetant un coup d'œil à son code source, une classe a attiré mon intérêt, LazyLoader dans __init__.py (enfin, cela reflète en fait le projet wandb, mais le code d'origine a changé par rapport à ce que MLflow utilise actuellement, comme vous pouvez le voir).
Vous avez probablement entendu parler du concept de chargement différé dans de nombreux contextes, tels que le chargement d'images frontales Web, la stratégie de mise en cache, etc. Je pense que l'essence de tous ces concepts de chargement paresseux est que "Je suis trop paresseux pour charger MAINTENANT" - oui, les mots cachés "en ce moment" . À savoir, l’application chargera et utilisera cette ressource uniquement lorsque cela sera nécessaire. Ainsi, ici, dans cette bibliothèque MLflow, les modules ne sont chargés que lorsque les ressources qu'ils contiennent (variables, fonctions et classes) sont accessibles.
Mais COMMENT ? C'était mon principal intérêt. J'ai donc lu le code source, qui semblait très simple à première vue. Cependant, étonnamment, il a fallu un peu de temps pour comprendre comment cela fonctionne, et j'ai beaucoup appris en lisant le code. Cet article concerne l'analyse de ce code source de MLflow afin que nous comprenions comment fonctionne un tel chargement paresseux à l'aide de diverses techniques du langage Python.
Pour les besoins de notre analyse, j'ai créé un package simple appelé lazyloading sur ma machine locale et placé les modules comme suit :
lazyloading/ ├─ __init__.py ├─ __main__.py ├─ lazy_load.py ├─ heavy_module.py
import time for i in range(5): time.sleep(1) print(5 - i, " seconds left before loading") print("I am heavier than Pytorch!") HEAVY_ATTRIBUTE = "heavy”
Ensuite, nous importons ce heavy_module dans __main__.py :
if __name__ == "__main__": from lazyloading import heavy_module
Exécutons ce package et voyons le résultat :
python -m lazyloading 5 seconds left before loading 4 seconds left before loading 3 seconds left before loading 2 seconds left before loading 1 seconds left before loading I am heavier than pytorch!
Ici, nous pouvons clairement voir que si nous importons simplement des packages lourds tels que PyTorch, cela pourrait entraîner une surcharge pour l'ensemble de l'application. C'est pourquoi nous avons besoin d'un chargement paresseux ici. Changeons __main__.py pour qu'il ressemble à ceci :
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module") print("nothing happens yet") print(heavy_module.HEAVY_ATTRIBUTE)
Et le résultat devrait être :
python -m lazyloading nothing happens yet 5 seconds left before loading 4 seconds left before loading 3 seconds left before loading 2 seconds left before loading 1 seconds left before loading heavy
Oui, tout module importé par LazyLoader n'a pas besoin d'exécuter de script ni d'importer d'autres packages. Cela se produit uniquement lors de l'accès à un attribut du module. C'est le pouvoir du lazyloading !
Le code lui-même est court et simple. J'ai ajouté des annotations de type et quelques commentaires (lignes entourées de <, >) pour des explications. Tous les autres commentaires sont ceux du code source original.
"""Utility to lazy load modules.""" import importlib import sys import types from typing import Any, TypeVar T = TypeVar("T") # <this is added by me> class LazyLoader(types.ModuleType): """Class for module lazy loading. This class helps lazily load modules at package level, which avoids pulling in large dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader: https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9 """ _local_name: str # <the name of the package that is used inside code> _parent_module_globals: dict[str, types.ModuleType] # <importing module's namespace, accessible by calling globals()> _module: types.ModuleType | None # <actual module> def __init__( self, local_name: str, parent_module_globals: dict[str, types.ModuleType], name: Any # <to be used in types.ModuleType(name=str(name)), the full package name (such as pkg.subpkg.subsubpkg)> ): self._local_name = local_name self._parent_module_globals = parent_module_globals self._module = None super().__init__(str(name)) def _load(self) -> types.ModuleType: """Load the module and insert it into the parent's globals.""" if self._module: # If already loaded, return the loaded module. return self._module # Import the target module and insert it into the parent's namespace # <see https://docs.python.org/3/library/importlib.html#importlib.import_module> # <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)> # <here, self.__name__ is the variable `name` in __init__> # <this is why that `name` in __init__ must be the full module path> module = importlib.import_module(self.__name__) # this automatically updates sys.modules # <add the name of the module to the importing module(=parent module)'s namespace> # <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module> self._parent_module_globals[self._local_name] = module # <add the module to the list of loaded modules for caching> # <see https://docs.python.org/3/reference/import.html#the-module-cache> # <this makes possible to import cached module with the variable _local_name sys.modules[self._local_name] = module # Update this object's dict so that if someone keeps a reference to the `LazyLoader`, # lookups are efficient (`__getattr__` is only called on lookups that fail). self.__dict__.update(module.__dict__) return module def __getattr__(self, item: T) -> T: module = self._load() return getattr(module, item) def __dir__(self): module = self._load() return dir(module) def __repr__(self): if not self._module: return f"<module '{self.__name__} (Not loaded yet)'>" return repr(self._module)
Maintenant, étudions le code tout en chargeant paresseux notre heavy_module. Puisque nous n'avons plus besoin de simuler la lourdeur du module, supprimons la partie boucle time.sleep(1).
Regardons __init__() de LazyLoader.
class LazyLoader(types.ModuleType): # … # code omitted # … def __init__( self, local_name: str, parent_module_globals: dict[str, types.ModuleType], name: Any # <to be used in types.ModuleType(name=str(name)); the full package name(such as pkg.subpkg.subsubpkg)> ): self._local_name = local_name self._parent_module_globals = parent_module_globals self._module = None super().__init__(str(name))
Nous fournissons local_name, parent_module_globals et le nom du constructeur __init__(). Pour le moment, nous ne sommes pas sûrs de ce que tout cela signifie, mais au moins la dernière ligne indique que nous générons réellement un module - super().__init__(str(name)), puisque LazyLoader hérite de types.ModuleType. En fournissant le nom de la variable, notre module créé par LazyLoader est reconnu comme un module portant le nom name (qui est le même que heavy_module.__name__).
L'impression du module lui-même le prouve :
# __main__.py # run python -m lazyloading if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module") print(heavy_module.__name__)
qui donne sur notre borne :
lazyloading.heavy_module
Cependant, dans le constructeur, nous avons uniquement attribué des valeurs aux variables d'instance et donné le nom du module à ce proxy module. Maintenant, que se passe-t-il lorsque l'on essaie d'accéder à un attribut du module ?
C'est l'une des parties amusantes de ce cours. Que se passe-t-il lorsque l'on accède à un attribut d'un objet Python en général ? Supposons que nous accédions à HEAVY_ATTRIBUTE de heavy_module en appelant heavy_module.HEAVY_ATTRIBUTE. À partir du code ici ou de votre propre expérience dans plusieurs projets Python, vous pourriez deviner que __getattr__() est appelé, et c'est en partie correct. Regardez les documents officiels :
Appelé lorsque l'accès à l'attribut par défaut échoue avec une AttributeError (soit getattribute() lève une AttributeError car name n'est pas un attribut d'instance ou un attribut dans l'arborescence de classes pour soi ; ou get d'une propriété de nom déclenche AttributeError).
(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).
So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:
def __getattribute__(self, name: str) -> Any: try: print(f"__getattribute__ is called when accessing attribute '{name}'") return super().__getattribute__(name) except Exception as error: print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}") raise def __getattr__(self, item: T) -> T: print(f"__getattr__ is called when accessing attribute '{item}'") module = self._load() return getattr(module, item)
When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module") print(heavy_module.HEAVY_ATTRIBUTE)
python -m lazyloading __getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE' an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE' __getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE' __getattribute__ is called when accessing attribute '_load' __getattribute__ is called when accessing attribute '_module' __getattribute__ is called when accessing attribute '__name__' I am heavier than Pytorch! __getattribute__ is called when accessing attribute '_parent_module_globals' __getattribute__ is called when accessing attribute '_local_name' __getattribute__ is called when accessing attribute '__dict__' heavy
So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().
But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module") print(heavy_module.HEAVY_ATTRIBUTE) print(heavy_module.HEAVY_ATTRIBUTE)
Now we see the additional logs on the terminal:
# … the same log as above __getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE' heavy
It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.
This section covers the core part the post - loading the actual module in the function _load().
First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).
if self._module: # If already loaded, return the loaded module. return self._module
Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:
# <see https://docs.python.org/3/library/importlib.html#importlib.import_module> # <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)> # <here, self.__name__ is the variable `name` in __init__> # <this is why that `name` in __init__ must be the full module path> module = importlib.import_module(self.__name__) # this automatically updates sys.modules
According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("heavy_module", globals(), "heavy_module") print(heavy_module.HEAVY_ATTRIBUTE)
# logs omitted ModuleNotFoundError: No module named 'heavy_module'
Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module") # check whether the module is cached at the global scope import sys print("lazyloading.heavy_module" in sys.modules) # accessing any attribute to load the module heavy_module.HEAVY_ATTRIBUTE print("lazyloading.heavy_module" in sys.modules)
and run the package, then the logs should be:
python -m lazyloading False I am heavier than Pytorch! True
This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.
# <add the name of the module to the importing module(=parent module)'s namespace> # <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module> self._parent_module_globals[self._local_name] = module # <add the module to the list of loaded modules for caching> # <see https://docs.python.org/3/reference/import.html#the-module-cache> # <this makes possible to import cached module with the variable _local_name sys.modules[self._local_name] = module
Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?
First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word globals means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:
if __name__ == "__main__": from lazyloading.lazy_load import LazyLoader def load_heavy_module() -> None: # import the module inside a function heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module") print(heavy_module.HEAVY_ATTRIBUTE) # loads the heavy_module inside the function's scope load_heavy_module() # the module is now in the scope of this module print(heavy_module)
Running the package gives:
python -m lazyloading I am heavier than Pytorch! heavy <module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)
Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).
Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :
lazyloading/ ├─ __init__.py ├─ __main__.py ├─ lazy_load.py ├─ heavy_module.py ├─ heavy_module_loader.py
Note that I used a custom name heavy_module_local for the local variable name of the proxy module.
# heavy_module_loader.py from lazyloading.lazy_load import LazyLoader heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module") heavy_module.HEAVY_ATTRIBUTE
Now let __main__.py be simpler:
from lazyloading import heavy_module_loader if __name__ == "__main__": import heavy_module_local print(heavy_module_local)
Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:
python -m lazyloading I am heavier than Pytorch! <module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)
Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.
# Update this object's dict so that if someone keeps a reference to the `LazyLoader`, # lookups are efficient (`__getattr__` is only called on lookups that fail). self.__dict__.update(module.__dict__)
In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.
In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.
Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.
<p>if not self.<em>module</em>:<br> return f"<module '{self.<em>name</em>_} (Not loaded yet)'>"<br> return repr(self._module)</p>
Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!
Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!