search
HomeSystem TutorialLINUXWill machine learning put database management system operators out of work?
Will machine learning put database management system operators out of work?Jan 13, 2024 pm 06:24 PM
linuxlinux tutorialRed Hatlinux systemlinux commandlinux certificationred hat linuxlinux video

Introduction The database management system (DBMS) is the most important part of any data-intensive application system. They can handle large amounts of data and complex workloads. But they are difficult to manage because they have hundreds or thousands of configuration "knobs" that control factors such as the amount of memory used for caching and how often data is written to the storage device. Organizations often hire experts to help fine-tune their campaigns, but experts are prohibitively expensive for many businesses.

This article was written by three guests from Carnegie Mellon University: Dana Van Aken, Andy Pavlo and Geoff Gordon Co-authored article. This project demonstrates how academic researchers can use the AWS Cloud Credits for Research Program (https://aws.amazon.com/research-credits/) to support their scientific breakthroughs.

OtterTune is a new tool developed by students and researchers in the Carnegie Mellon University Database Group (http://db.cs.cmu.edu/projects/autotune/) that automatically Configure button to find the appropriate setting. The goal is to make it easier for anyone to deploy a DBMS, even those with no expertise in database administration.

OtterTune is different from other DBMS configuration tools because it fully leverages the knowledge gained from tuning previously deployed DBMSs to tune newly deployed DBMSs. This significantly reduces the time and resources required to tune a newly deployed DBMS. To this end, OtterTune maintains a database containing tuning data collected from previous tuning sessions. It uses this data to build machine learning models that capture information about how the DBMS reacts to different configurations. OtterTune uses these models to guide users when trying out new applications, suggesting settings that improve specific goals, such as reducing latency or increasing throughput.

In this article we explore each component of OtterTune’s machine learning pipeline and demonstrate how they relate to each other to tune the configuration of your DBMS. We then evaluated OtterTune's performance on MySQL and Postgres by comparing the performance of its optimal configurations with configurations chosen by database administrators (DBAs) and other automated tuning tools.

OtterTune is an open source tool developed by students and researchers in the Database Group at Carnegie Mellon University. All code is placed on GitHub (https://github.com/cmu-db/ottertune) and is released under the Apache License 2.0.

working principle

The picture below shows the components and workflow of OtterTune.

Will machine learning put database management system operators out of work?

At the beginning of a new tuning session, the user tells OtterTune which specific goal to optimize for (such as latency or throughput). The client controller connects to the target DBMS and collects the Amazon EC2 instance type and current target.

The controller then begins the first observation period, during which it observes the DBMS and logs specific targets. After the observation period is over, the controller collects internal metrics from the DBMS, such as MySQL counts of pages read from disk and pages written to disk. The controller returns both specific goals and internal metrics to the tuning manager.

After OtterTune's tuning manager receives the metrics, it stores them in the database. OtterTune uses the results to calculate the next configuration the controller should be installed on the target DBMS. The tuning manager returns this configuration to the controller and estimates the expected improvements through actual runs. The user can decide to continue the tuning session or terminate it.

illustrate

OtterTune maintains a button blacklist for each DBMS version it supports. The blacklist includes buttons that do not need to be tuned (such as the path name of the DBMS storage file), or buttons that may have serious or hidden consequences (such as may cause the DBMS to lose data). At the beginning of each tuning session, OtterTune provides users with a blacklist so they can add any other buttons they want OtterTune to avoid tuning.

OtterTune makes certain assumptions that may limit its usefulness to some users. For example, it assumes that the user has administrator rights, allowing the controller to modify the DBMS configuration. If the user does not have administrator rights, they can deploy a second copy of the database to other hardware for OtterTune tuning experiments. This requires users to replay workload traces or forward queries from a production-grade DBMS. For a complete discussion of assumptions and limitations, see our paper (http://db.cs.cmu.edu/papers/2017/tuning-sigmod2017.pdf).

Machine Learning Pipeline

The diagram below shows how data is processed as it passes through OtterTune’s machine learning pipeline. All observations are placed in OtterTune's database.

OtterTune first transmits the observation results to the Workload Characterization component. This component identifies a small set of DBMS metrics that most accurately capture performance changes and unique characteristics of different workloads.

Next, the Knob Identification component generates a button sorting list, listing the buttons that have the greatest impact on the performance of the DBMS. OtterTune then feeds all this information to Automatic Tuner. This component maps the workload of the target DBMS to the most similar workload in the data repository and reuses the workload data to generate a more appropriate configuration.

Will machine learning put database management system operators out of work?Will machine learning put database management system operators out of work?

Now let’s dive into each component of the machine learning pipeline.

Workload Characterization: OtterTune uses the DBMS's internal runtime metrics to describe the behavioral characteristics of the workload. These metrics accurately represent the workload because they capture many aspects of runtime behavior. However, many metrics are redundant: some are the same metric value recorded in different units, and others represent independent parts of the DBMS where values ​​are highly correlated. Streamlining redundant metrics is important because it reduces the complexity of the machine learning models that use them. To this end, we divide DBMS metrics into clusters based on correlation patterns. We then select a representative metric from each cluster, specifically the one closest to the cluster center. Subsequent components in the machine learning pipeline use these metrics.

Knob Identification: A DBMS may have hundreds of buttons, but only a small number of buttons affect the performance of the DBMS. OtterTune uses a popular feature selection technique called Lasso to decide which buttons significantly affect the overall performance of your system. By applying this technique to data in a database, OtterTune can identify the DBMS' button order of importance.

OtterTune then has to decide how many buttons to use in the proposed configuration. Using too many buttons greatly increases OtterTune's optimization time. Using too few buttons prevents OtterTune from finding the optimal configuration. To automate this process, OtterTune uses an incremental approach. It gradually increases the number of buttons used in a tuning session. This approach allows OtterTune to explore and optimize configurations for a small set of the most important buttons, and then expand the scope to consider additional buttons.

Automatic Tuner: The Automated Tuning component determines which configuration OtterTune should recommend by performing a two-step analysis after each observation period.

First, the system identifies the workload from a previous tuning session that best represents the target DBMS workload using performance data for the metrics identified in the Workload Characterization component. It compares the session's metrics to metrics from previous workloads to see which ones respond similarly to different button settings.

Then, OtterTune selects another button configuration to give it a try. It fits statistical models to the data that has been collected, as well as data from the most similar workloads in the repository. This model allows OtterTune to predict how the DBMS will perform using every possible configuration. OtterTune optimizes the next configuration to strike a balance between exploration (gathering information to improve the model) and exploitation (performing as well as possible on a specific metric).

accomplish

OtterTune is written in Python.

As far as the two components of Workload Characterization and Knob Identification are concerned, runtime performance is not the main issue to worry about, so we used scikit-learn to implement the corresponding machine learning algorithm. These algorithms run in a background process and will integrate new data as soon as it becomes available in OtterTune's database.

As for the Automatic Tuner, the machine learning algorithm is on the critical path. They run after each observation period, integrating new data so that OtterTune can select a button configuration to try next. Since performance is a consideration, we implemented these algorithms using TensorFlow.

To collect data on DBMS hardware, button configurations and runtime performance metrics, we integrated OtterTune's controller with the OLTP-Bench benchmarking framework.

Evaluate

For evaluation, we compared the best configuration selected by OtterTune with the following configurations for MySQL and Postgres performance:

  • Default: Configuration provided by DBMS
  • Tuning script: Configuration generated by open source tuning advisory tool
  • DBA: Database Administrator Generated Configuration
  • RDS: Configuration customized for DBMS, managed by Amazon RD, deployed on EC2 instances of the same type.

We conducted all experiments on Amazon EC2 Spot Instances. We conducted each trial on two instances: one for OtterTune's controller and another for the deployed target DBMS system. We used m4.large and m3.xlarge instance types respectively. We deployed OtterTune's tuning manager and data database on a local server equipped with 20 cores and 128GB of memory.

We used the TPC-C workload, which is the industry standard for evaluating the performance of online transaction processing (OLTP) systems.

We measured latency and throughput against each database we used in our experiments: MySQL and Postgres. The following figures show the results. The first graph shows the amount of 99th percentile latency, which represents the "worst case" time it takes for a transaction to complete. The second graph shows the results for throughput, measured as the average number of transactions completed per second.

MySQL results:

Will machine learning put database management system operators out of work?

Comparing the optimal configuration generated by OtterTune with the configuration generated by the tuning script and RDS, you will find that if you use OtterTune configuration, MySQL's latency is reduced by about 60% and the throughput is increased by 35%. OtterTune also produces configurations with results that are as good as those chosen by the database administrator.

A few MySQL buttons have a significant impact on the performance of TPC-C workloads. The configurations generated by OtterTune and the database administrator provide good settings for each of these buttons. RDS performed a little less well due to providing sub-optimal settings for one button. The configuration of the tuning script performed the worst since only one button was modified.

Results for Postgres:

Will machine learning put database management system operators out of work?

In terms of latency, OtterTune, tuning tools, database management, and RDS-generated configurations all offer similar improvements over Postgres' default settings. We can probably attribute this to the overhead required for the round trip between the OLTP-Bench client and the DBMS over the network. As for throughput, if you use the configuration recommended by OtterTune, Postgres performance is about 12% higher than the configuration selected by the database administrator and tuning script, and about 32% higher than RDS.

Similar to MySQL, only a few buttons have a significant impact on Postgres performance. OtterTune, database administrators, tuning scripts, and RDS-generated configurations all modify these buttons, and most provide pretty good settings.

Conclusion

OtterTune automates the process of finding the appropriate settings for a DBMS configuration button. To tune a newly deployed DBMS, it reuses training data collected from previous tuning sessions. Because OtterTune does not require the generation of initial data sets for training machine learning models, tuning time is significantly reduced.

What’s next? To accommodate the growing popularity of DBaaS deployments that do not have remote access to the DBMS's host system, OtterTune will soon be able to automatically detect the hardware capabilities of the target DMBS without requiring remote access.

For more details on OtterTune, see our paper or the code on GitHub. Please stay tuned to this website (http://ottertune.cs.cmu.edu/), we will soon launch OtterTune, an online tuning service.
About the Author:

Dana Van Aken is a doctoral student in computer science at Carnegie Mellon University, mentored by Dr. Andrew Pavlo.

Andy Pavlo is an assistant professor of database science in the Department of Computer Science at Carnegie Mellon University.

Geoff Gordon is an associate professor and associate director of education in the Machine Learning Department at Carnegie Mellon University.

The above is the detailed content of Will machine learning put database management system operators out of work?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:Linux就该这么学. If there is any infringement, please contact admin@php.cn delete
什么是linux设备节点什么是linux设备节点Apr 18, 2022 pm 08:10 PM

linux设备节点是应用程序和设备驱动程序沟通的一个桥梁;设备节点被创建在“/dev”,是连接内核与用户层的枢纽,相当于硬盘的inode一样的东西,记录了硬件设备的位置和信息。设备节点使用户可以与内核进行硬件的沟通,读写设备以及其他的操作。

Linux中open和fopen的区别有哪些Linux中open和fopen的区别有哪些Apr 29, 2022 pm 06:57 PM

区别:1、open是UNIX系统调用函数,而fopen是ANSIC标准中的C语言库函数;2、open的移植性没fopen好;3、fopen只能操纵普通正规文件,而open可以操作普通文件、网络套接字等;4、open无缓冲,fopen有缓冲。

linux中什么叫端口映射linux中什么叫端口映射May 09, 2022 pm 01:49 PM

端口映射又称端口转发,是指将外部主机的IP地址的端口映射到Intranet中的一台计算机,当用户访问外网IP的这个端口时,服务器自动将请求映射到对应局域网内部的机器上;可以通过使用动态或固定的公共网络IP路由ADSL宽带路由器来实现。

linux中eof是什么linux中eof是什么May 07, 2022 pm 04:26 PM

在linux中,eof是自定义终止符,是“END Of File”的缩写;因为是自定义的终止符,所以eof就不是固定的,可以随意的设置别名,linux中按“ctrl+d”就代表eof,eof一般会配合cat命令用于多行文本输出,指文件末尾。

linux怎么判断pcre是否安装linux怎么判断pcre是否安装May 09, 2022 pm 04:14 PM

在linux中,可以利用“rpm -qa pcre”命令判断pcre是否安装;rpm命令专门用于管理各项套件,使用该命令后,若结果中出现pcre的版本信息,则表示pcre已经安装,若没有出现版本信息,则表示没有安装pcre。

什么是linux交叉编译什么是linux交叉编译Apr 29, 2022 pm 06:47 PM

在linux中,交叉编译是指在一个平台上生成另一个平台上的可执行代码,即编译源代码的平台和执行源代码编译后程序的平台是两个不同的平台。使用交叉编译的原因:1、目标系统没有能力在其上进行本地编译;2、有能力进行源代码编译的平台与目标平台不同。

linux中rpc是什么意思linux中rpc是什么意思May 07, 2022 pm 04:48 PM

在linux中,rpc是远程过程调用的意思,是Reomote Procedure Call的缩写,特指一种隐藏了过程调用时实际通信细节的IPC方法;linux中通过RPC可以充分利用非共享内存的多处理器环境,提高系统资源的利用率。

linux怎么查询mac地址linux怎么查询mac地址Apr 24, 2022 pm 08:01 PM

linux查询mac地址的方法:1、打开系统,在桌面中点击鼠标右键,选择“打开终端”;2、在终端中,执行“ifconfig”命令,查看输出结果,在输出信息第四行中紧跟“ether”单词后的字符串就是mac地址。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function