search
HomeBackend DevelopmentPython TutorialEight Python libraries to improve data science productivity!

Eight Python libraries to improve data science productivity!

1. Optuna

Optuna is an open source hyperparameter optimization framework that can automatically find the best hyperparameters for machine learning models.

The most basic (and probably well-known) alternative is sklearn's GridSearchCV, which will try multiple hyperparameter combinations and choose the best one based on cross-validation.

GridSearchCV will attempt combinations within the previously defined space. For example, for a random forest classifier, you might want to test the maximum depth of several different trees. GridSearchCV provides all possible values ​​for each hyperparameter and looks at all combinations.

Optuna uses its own history of attempts within the defined search space to determine which values ​​to try next. The method it uses is a Bayesian optimization algorithm called "Tree-structured Parzen Estimator".

This different approach means that instead of pointlessly trying every value, it looks for the best candidate before trying it, saving time that would otherwise be spent trying There is no hope for an alternative (and one that might also produce better results).

Finally, it is framework agnostic, which means you can use it with TensorFlow, Keras, PyTorch, or any other ML framework.

2. ITMO_FS

ITMO_FS is a feature selection library that can perform feature selection for ML models. The fewer observations you have, the more careful you need to be with too many features to avoid overfitting. By "prudent" I mean you should standardize your model. Usually a simpler model (fewer features) is easier to understand and interpret.

ITMO_FS algorithms are divided into 6 different categories: supervised filters, unsupervised filters, wrappers, hybrids, embedded, ensembles (although it mainly focuses on supervised filters).

A simple example of a "supervised filter" algorithm is to select features based on their correlation with a target variable. With "backward selection", you can try to remove features one by one and confirm how these features affect the model's predictive ability.

Here's a trivial example of how to use ITMO_FS and its impact on model scores:

>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS
>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)
>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333
>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334

ITMO_FS is a relatively new library, so it's still a bit unstable, But I still recommend giving it a try.

3. shap-hypetune

So far we have seen libraries for feature selection and hyperparameter tuning, but why not use both at the same time? ? This is what shap-hypetune does.

Let’s start by understanding what “SHAP” is:

“SHAP (SHapley Additive exPlanations) is a game theory method for explaining any machine learning model The output of."

SHAP is one of the most widely used libraries for interpreting models, and it works by generating the importance of each feature to the model's final prediction.

On the other hand, shap-hypertune benefits from this approach to select the best features and also the best hyperparameters. Why do you want to merge them together? Selecting features and tuning hyperparameters independently may lead to suboptimal choices because their interactions are not considered. Doing both simultaneously not only takes this into account, but also saves some coding time (although the runtime may increase due to the increased search space).

Search can be done in 3 ways: grid search, random search, or Bayesian search (plus, it can be parallelized).

However, shap-hypertune only works on gradient boosting models!

4. PyCaret

PyCaret is an open source, low-code machine learning library that automates machine learning workflows. It covers exploratory data analysis, preprocessing, modeling (including interpretability), and MLOps.

Let’s take a look at some practical examples on their website to see how it works:

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()

Eight Python libraries to improve data science productivity!

Just a few lines of code , it was possible to try multiple models and compare them across the main classification metrics.

It also allows the creation of a basic application to interact with the model:

from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,target = 'Purchase')
lr = create_model('lr')
create_app(lr)

Finally, API and Docker files can be easily created for the model:

from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,target = 'Purchase')
lr = create_model('lr')
create_api(lr, 'lr_api')
create_docker('lr_api')

It doesn't get any easier than this, right?

PyCaret is a very complete library, it is difficult to cover everything here, it is recommended that you download it now and start using it to understand some of its capabilities in practice.

5. floWeaver

FloWeaver can generate Sankey diagrams from streaming data sets. If you don’t know what a Sankey diagram is, here’s an example:

Eight Python libraries to improve data science productivity!

They are very useful when showing data for conversion funnels, marketing journeys, or budget allocations (Example above). The portal data should be in the following format: "source x target x value" Such a plot can be created with just one line of code (very specific, but also very intuitive).

6、Gradio

如果你阅读过敏捷数据科学,就会知道拥有一个让最终用户从项目开始就与数据进行交互的前端界面是多么有帮助。一般情况下在Python中最常用是 Flask,但它对初学者不太友好,它需要多个文件和一些 html、css 等知识。

Gradio 允许您通过设置输入类型(文本、复选框等)、功能和输出来创建简单的界面。尽管它似乎不如 Flask 可定制,但它更直观。

由于 Gradio 现在已经加入 Huggingface,可以在互联网上永久托管 Gradio 模型,而且是免费的!

7、Terality

理解 Terality 的最佳方式是将其视为“Pandas ,但速度更快”。这并不意味着完全替换 pandas 并且必须重新学习如何使用df:Terality 与 Pandas 具有完全相同的语法。实际上,他们甚至建议“import Terality as pd”,并继续按照以前的习惯的方式进行编码。

它快多少?他们的网站有时会说它快 30 倍,有时快 10 到 100 倍。

另一个重要是 Terality 允许并行化并且它不在本地运行,这意味着您的 8GB RAM 笔记本电脑将不会再出现 MemoryErrors!

但它在背后是如何运作的呢?理解 Terality 的一个很好的比喻是可以认为他们在本地使用的 Pandas 兼容的语法并编译成 Spark 的计算操作,使用Spark进行后端的计算。所以计算不是在本地运行,而是将计算任务提交到了他们的平台上。

那有什么问题呢?每月最多只能免费处理 1TB 的数据。如果需要更多则必须每月至少支付 49 美元。1TB/月对于测试工具和个人项目可能绰绰有余,但如果你需要它来实际公司使用,肯定是要付费的。

8、torch-handle

如果你是Pytorch的使用者,可以试试这个库。

torchhandle是一个PyTorch的辅助框架。它将PyTorch繁琐和重复的训练代码抽象出来,使得数据科学家们能够将精力放在数据处理、创建模型和参数优化,而不是编写重复的训练循环代码。使用torchhandle,可以让你的代码更加简洁易读,让你的开发任务更加高效。

torchhandle将Pytorch的训练和推理过程进行了抽象整理和提取,只要使用几行代码就可以实现PyTorch的深度学习管道。并可以生成完整训练报告,还可以集成tensorboard进行可视化。

from collections import OrderedDict
import torch
from torchhandle.workflow import BaseConpython
class Net(torch.nn.Module):
def __init__(self, ):
super().__init__()
self.layer = torch.nn.Sequential(OrderedDict([
('l1', torch.nn.Linear(10, 20)),
('a1', torch.nn.ReLU()),
('l2', torch.nn.Linear(20, 10)),
('a2', torch.nn.ReLU()),
('l3', torch.nn.Linear(10, 1))
]))

def forward(self, x):
x = self.layer(x)
return x

num_samples, num_features = int(1e4), int(1e1)
X, Y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = torch.utils.data.TensorDataset(X, Y)
trn_loader = torch.utils.data.DataLoader(dataset, batch_size=64, num_workers=0, shuffle=True)
loaders = {"train": trn_loader, "valid": trn_loader}
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = {"fn": Net}
criterion = {"fn": torch.nn.MSELoss}
optimizer = {"fn": torch.optim.Adam,
 "args": {"lr": 0.1},
 "params": {"layer.l1.weight": {"lr": 0.01},
"layer.l1.bias": {"lr": 0.02}}
 }
scheduler = {"fn": torch.optim.lr_scheduler.StepLR,
 "args": {"step_size": 2, "gamma": 0.9}
 }

c = BaseConpython(model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
conpython_tag="ex01")
train = c.make_train_session(device, dataloader=loaders)
train.train(epochs=10)

定义一个模型,设置数据集,配置优化器、损失函数就可以自动训练了,是不是和TF差不多了。

The above is the detailed content of Eight Python libraries to improve data science productivity!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
详细讲解Python之Seaborn(数据可视化)详细讲解Python之Seaborn(数据可视化)Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

详细了解Python进程池与进程锁详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

Python自动化实践之筛选简历Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

归纳总结Python标准库归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于标准库总结的相关问题,下面一起来看一下,希望对大家有帮助。

Python数据类型详解之字符串、数字Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

分享10款高效的VSCode插件,总有一款能够惊艳到你!!分享10款高效的VSCode插件,总有一款能够惊艳到你!!Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

详细介绍python的numpy模块详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

python中文是什么意思python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use