Home >Backend Development >Python Tutorial >Eight Python libraries to improve data science efficiency!
Optuna is an open source hyperparameter optimization framework that can automatically find the best hyperparameters for machine learning models.
The most basic (and probably well-known) alternative is sklearn's GridSearchCV, which will try multiple hyperparameter combinations and choose the best one based on cross-validation.
GridSearchCV will attempt combinations within the previously defined space. For example, for a random forest classifier, you might want to test the maximum depth of several different trees. GridSearchCV provides all possible values for each hyperparameter and looks at all combinations.
Optuna will use its own history of attempts in the defined search space to determine the values to try next. The method it uses is a Bayesian optimization algorithm called "Tree-structured Parzen Estimator".
This different approach means that instead of pointlessly trying every value, it looks for the best candidate before trying it, saving time that would otherwise be spent trying hopelessly alternatives (and may also yield better results).
Finally, it is framework agnostic, which means you can use it with TensorFlow, Keras, PyTorch, or any other ML framework.
ITMO_FS is a feature selection library that can perform feature selection for ML models. The fewer observations you have, the more careful you need to be with too many features to avoid overfitting. By "prudent" I mean you should standardize your model. Usually a simpler model (fewer features) is easier to understand and interpret.
ITMO_FS algorithms are divided into 6 different categories: supervised filters, unsupervised filters, wrappers, hybrids, embedded, ensembles (although it mainly focuses on supervised filters).
A simple example of a "supervised filter" algorithm is to select features based on their correlation with a target variable. With "backward selection", you can try to remove features one by one and confirm how these features affect the model's predictive ability.
Here's a trivial example of how to use ITMO_FS and its impact on model scores:
>>> from sklearn.linear_model import SGDClassifier >>> from ITMO_FS.embedded import MOS >>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2) >>> sel = MOS() >>> trX = sel.fit_transform(X, y, smote=False) >>> cl1 = SGDClassifier() >>> cl1.fit(X, y) >>> cl1.score(X, y) 0.9033333333333333 >>> cl2 = SGDClassifier() >>> cl2.fit(trX, y) >>> cl2.score(trX, y) 0.9433333333333334
ITMO_FS is a relatively new library, so it's still a bit unstable, but I still It is recommended to try it.
So far we have seen libraries for feature selection and hyperparameter tuning, but why not use both at the same time? This is what shap-hypetune does.
Let’s start by understanding what “SHAP” is:
“SHAP (SHapley Additive exPlanations) is a game theory method for interpreting the output of any machine learning model.”
SHAP is one of the most widely used libraries for interpreting models, and it works by generating the importance of each feature to the model's final prediction.
On the other hand, shap-hypertune benefits from this approach to select the best features, but also the best hyperparameters. Why do you want to merge them together? Selecting features and tuning hyperparameters independently may lead to suboptimal choices because their interactions are not considered. Doing both simultaneously not only takes this into account, but also saves some coding time (although the runtime may increase due to the increased search space).
The search can be done in 3 ways: grid search, random search, or Bayesian search (plus, it can be parallelized).
However, shap-hypertune only works on gradient boosting models!
PyCaret is an open source, low-code machine learning library that can automate machine learning workflows. It covers exploratory data analysis, preprocessing, modeling (including interpretability), and MLOps.
Let’s take a look at some practical examples on their website to see how it works:
# load dataset from pycaret.datasets import get_data diabetes = get_data('diabetes') # init setup from pycaret.classification import * clf1 = setup(data = diabetes, target = 'Class variable') # compare models best = compare_models()
With just a few lines of code, you’re good to go Multiple models were tried and compared across major classification metrics. It also allows the creation of a basic application to interact with the model:
from pycaret.datasets import get_data juice = get_data('juice') from pycaret.classification import * exp_name = setup(data = juice,target = 'Purchase') lr = create_model('lr') create_app(lr)
Finally, API and Docker files can be easily created for the model:
from pycaret.datasets import get_data juice = get_data('juice') from pycaret.classification import * exp_name = setup(data = juice,target = 'Purchase') lr = create_model('lr') create_api(lr, 'lr_api') create_docker('lr_api')
It doesn't get easier than this, Right?
PyCaret is a very complete library, it is difficult to cover everything here, it is recommended that you download it now and start using it to understand some of its capabilities in practice.
FloWeaver can generate Sankey diagrams from streaming data sets. If you don’t know what a Sankey diagram is, here’s an example:
#They are very useful when showing data for conversion funnels, marketing journeys, or budget allocations (example above ). The portal data should be in the following format: "source x target x value" Such a plot can be created with just one line of code (very specific, but also very intuitive).
如果你阅读过敏捷数据科学,就会知道拥有一个让最终用户从项目开始就与数据进行交互的前端界面是多么有帮助。一般情况下在Python中最常用是 Flask,但它对初学者不太友好,它需要多个文件和一些 html、css 等知识。
Gradio 允许您通过设置输入类型(文本、复选框等)、功能和输出来创建简单的界面。尽管它似乎不如 Flask 可定制,但它更直观。
由于 Gradio 现在已经加入 Huggingface,可以在互联网上永久托管 Gradio 模型,而且是免费的!
理解 Terality 的最佳方式是将其视为“Pandas ,但速度更快”。这并不意味着完全替换 pandas 并且必须重新学习如何使用df:Terality 与 Pandas 具有完全相同的语法。实际上,他们甚至建议“import Terality as pd”,并继续按照以前的习惯的方式进行编码。
它快多少?他们的网站有时会说它快 30 倍,有时快 10 到 100 倍。
另一个重要是 Terality 允许并行化并且它不在本地运行,这意味着您的 8GB RAM 笔记本电脑将不会再出现 MemoryErrors!
但它在背后是如何运作的呢?理解 Terality 的一个很好的比喻是可以认为他们在本地使用的 Pandas 兼容的语法并编译成 Spark 的计算操作,使用Spark进行后端的计算。所以计算不是在本地运行,而是将计算任务提交到了他们的平台上。
那有什么问题呢?每月最多只能免费处理 1TB 的数据。如果需要更多则必须每月至少支付 49 美元。1TB/月对于测试工具和个人项目可能绰绰有余,但如果你需要它来实际公司使用,肯定是要付费的。
如果你是Pytorch的使用者,可以试试这个库。
torchhandle是一个PyTorch的辅助框架。它将PyTorch繁琐和重复的训练代码抽象出来,使得数据科学家们能够将精力放在数据处理、创建模型和参数优化,而不是编写重复的训练循环代码。使用torchhandle,可以让你的代码更加简洁易读,让你的开发任务更加高效。
torchhandle将Pytorch的训练和推理过程进行了抽象整理和提取,只要使用几行代码就可以实现PyTorch的深度学习管道。并可以生成完整训练报告,还可以集成tensorboard进行可视化。
from collections import OrderedDict import torch from torchhandle.workflow import BaseConpython class Net(torch.nn.Module): def __init__(self, ): super().__init__() self.layer = torch.nn.Sequential(OrderedDict([ ('l1', torch.nn.Linear(10, 20)), ('a1', torch.nn.ReLU()), ('l2', torch.nn.Linear(20, 10)), ('a2', torch.nn.ReLU()), ('l3', torch.nn.Linear(10, 1)) ])) def forward(self, x): x = self.layer(x) return x num_samples, num_features = int(1e4), int(1e1) X, Y = torch.rand(num_samples, num_features), torch.rand(num_samples) dataset = torch.utils.data.TensorDataset(X, Y) trn_loader = torch.utils.data.DataLoader(dataset, batch_size=64, num_workers=0, shuffle=True) loaders = {"train": trn_loader, "valid": trn_loader} device = 'cuda' if torch.cuda.is_available() else 'cpu' model = {"fn": Net} criterion = {"fn": torch.nn.MSELoss} optimizer = {"fn": torch.optim.Adam, "args": {"lr": 0.1}, "params": {"layer.l1.weight": {"lr": 0.01}, "layer.l1.bias": {"lr": 0.02}} } scheduler = {"fn": torch.optim.lr_scheduler.StepLR, "args": {"step_size": 2, "gamma": 0.9} } c = BaseConpython(model=model, criterion=criterion, optimizer=optimizer, scheduler=scheduler, conpython_tag="ex01") train = c.make_train_session(device, dataloader=loaders) train.train(epochs=10)
定义一个模型,设置数据集,配置优化器、损失函数就可以自动训练了,是不是和TF差不多了。
The above is the detailed content of Eight Python libraries to improve data science efficiency!. For more information, please follow other related articles on the PHP Chinese website!