Home > Article > Backend Development > Extract large amounts of time series features with a small amount of code
Traditional machine learning algorithms cannot capture the temporal order of time series data. Data scientists need to perform relevant feature engineering to capture the important characteristics of the data into several metrics. Generating a large number of time series features and extracting relevant features from them is a time-consuming and tedious task.
#Python's tsfresh package can generate hundreds of standard common features for time series data. In this article, we will discuss the use of tsfresh package in depth.
tsfresh is an open source package that can generate hundreds of relevant time series features. Features generated from tsfresh can be used to solve classification, prediction, and outlier detection use cases.
The tsfresh package provides various functions for performing feature engineering on time series data, including:
Installing tsfresh is also very simple. The official installation method of pip and conda is provided:
pip install -U tsfresh# orconda install -c conda-forge tsfresh
tsfresh package Provides an automatic feature generation API that can generate more than 750 relevant features from 1 time series variable. A wide range of features are generated, including:
Using the tsfresh.extract_features() function, 789 features can be generated from multiple domains for 1 time series variable.
import pandas as pdfrom tsfresh import select_features# Read the time-series datadf = pd.read_excel("train.xlsx", parse_dates=['date']).set_index('date')# automated feature generationfeatures = tsfresh.extract_features(df, column_, column_sort="date")
Because there are too many features, please see the official documentation for a detailed introduction to all features.
The tsfresh package also provides a feature selection implementation based on hypothesis testing, which can be used to identify relevant features of the target variable. To limit the number of irrelevant features, tsfresh includes the fresh algorithm (fresh stands for feature extraction based on scalable hypothesis testing).
tsfresh.select_features() function users can implement feature selection.
When we have a large amount of time series data. tsfresh also provides APIs to extend feature generation/extraction, as well as feature selection implementation for large amounts of data:
Finally, tsfresh can generate and select relevant features for time series features in a few lines of Python code. It automatically extracts and selects 750 practically tested features from multiple domains of time-based data samples. It reduces a lot of work time wasted by data scientists on feature engineering.
And time series data is quite large. tsfresh also uses multi-threading and supports dask and spark to process large data samples that cannot be processed by a single machine.
The above is the detailed content of Extract large amounts of time series features with a small amount of code. For more information, please follow other related articles on the PHP Chinese website!