Home  >  Article  >  Database  >  MYSQL advanced for big data learning

MYSQL advanced for big data learning

coldplay.xixi
coldplay.xixiforward
2021-01-27 09:48:442310browse

MYSQL advanced for big data learning

Free learning recommendations: mysql video tutorial

Article Directory

  • 1 Several aspects that affect performance
    • 1.1 Hardware aspects
    • 1.2 Server system
    • 1.3 Database storage Engine selection
    • 1.4 Database parameter configuration
    • 1.5 Database structure design and SQL statements (key points)
  • 2 Hardware aspects
    • 2.1 CPU resources and available memory size
      • 2.1.1 How to choose CPU
      • 2.1.2 Memory
        • 2.1.2.1 Commonly used MySQL storage engines
        • 2.1.2.2 Tips
        • 2.1.2.3 How to choose memory
    • 2.2 Disk configuration and selection
      • 2.2.1 Using traditional machine hard disks
      • 2.2.2 Using RAID to enhance the performance of traditional machine hard disks
        • 2.2.2.1 What is RAID
        • 2.2.2.2 RAID Level
          • 2.2.2.2.1 RAID 0
          • 2.2.2.2. 2 RAID 1
          • 2.2.2.2.3 RAID 5 - Common RAID groups
          • 2.2.2.2.4 RAID 10 - Common RAID groups
        • 2.2.2.3 Selection of RAID level
      • 2.2.3 Using solid-state storage SSD and PCIe card
      • 2.2.4 Using network storage NAS and SAN
        • 2.2.4.1 Network storage usage scenarios
        • 2.2.4.2 Network performance limitations
        • 2.2.4.3 Network impact on performance
    • 2.3 Summary
  • 3 Impact of operating system on performance
    • 3.1 CentOS system Parameter optimization
  • 4 The impact of file system on performance
  • 5 MySQL architecture

1 Several aspects that affect performance

1.1 Hardware aspect

Usually personal computers are slow, and we all say it is because of computer hardware problems, usually factors such as CPU, memory, disk IO, etc. , so this problem will also occur on the server.

1.2 Server system

Generally, the operating system of personal computers is windows. The performance of different versions of windows systems is different, or certain parameters are configured to cause performance s difference. This is the same for server systems, and parameter settings will also affect server performance.

1.3 Selection of database storage engine

MySQL has a plug-in storage engine, and different storage engines can be selected according to different business needs.
Different storage engines also have different characteristics:

  • MyISAM: does not support transactions and table-level locks.
  • InnoDB: transaction-level storage engine, perfectly supports row-level locks and transaction ACID features.

1.4 Database parameter configuration

For different storage engines, their parameter configurations are different. Some parameters have minimal impact on the storage engine, but Some parameters play a decisive role in performance. Therefore, it is also important to optimize parameters based on the selected storage engine and different business needs.

1.5 Database structure design and SQL statements (key points)

When we design the database structure, we should consider what kind of sql statements we will execute on the database in the future. , to query and update the table structure. Only in this way can a table structure that meets the requirements be designed.
For slow queries, it is the main culprit of low performance, and it is caused by our unreasonable design of the database table structure. This type of SQL is also the most difficult to optimize, because once the project is online, it is difficult to modify the database table structure.

Therefore, our focus on optimizing database performance is:

  • Database table structure design

  • Preparation and optimization of SQL statements

The following is a detailed description of each aspect.

2 Hardware aspects

2.1 CPU resources and available memory size

2.1.1 How to choose CPU

Usually when choosing a CPU, we all hope that the frequency and number of cores of the CPU are both as high as possible, but due to cost or various factors, we are often forced to choose only one of them. So how should we choose the best solution? Therefore, we need to pay attention to several issues when purchasing a CPU:

  • Is our application CPU intensive?
    • If our application is CPU-intensive, to speed up sql processing, obviously we need better CPUs, not more CPUs.
    • For the current MySQL, duoCPU does not support concurrent processing of the same SQL.
  • What is the concurrency of our system?
    • If our system needs more throughput, then the more CPUs we have, the better. Assuming we have 40 CPUs, can we process 40 SQLs at the same time?
    • Measurement of database processing capabilities: QPS, which refers to the number of SQLs processed simultaneously. But this indicator is the number of SQLs processed in 1s, but the simultaneous processing explained in the previous point is in the nanosecond dimension.
    • MySQL is usually used in web applications, and the amount of concurrency is often relatively large. At this time, the number of CPUs is more important than the CPU frequency.
  • The version of MySQL we use
    • Before version 5.0, MySQL did not have good support for multi-core CPUs, and the restrictions on the system were very serious. In the current 5.6 and 5.7 versions, the support for multi-core CPUs has been greatly improved. Therefore, it is recommended to use the latest version of MySQL to achieve better performance.
  • Choose 32-bit or 64-bit CPU?
    • Currently, server CPUs are all 64-bit architecture by default, but be careful to check whether the system has a 32-bit server version installed on top of the 64-bit system. This will seriously affect server performance.

2.1.2 Memory

The size of the memory directly affects the performance of the database. Memory is currently much more efficient than disk. Therefore, caching data into memory can greatly improve server performance.

2.1.2.1 Commonly used MySQL storage engines

There are two commonly used storage engines: MyISAM and InnoDB.
MyISAM:
The index is stored in memory and the data is saved on the hard disk.
MYSQL advanced for big data learning
InnoDB:
Indexes and data are stored in memory, thereby improving the operating efficiency of the database.
MYSQL advanced for big data learning

2.1.2.2 Tips

  1. Although the more memory, the better, the impact on system performance is limited.
    If the data in our database is 100G, then the maximum performance can be achieved by selecting the memory around 128G. At this time, if all the data is hot data, it will be cached in the memory. There is no need to use 256G memory. However, choosing larger memory will also improve the performance of other services such as the operating system, and there is no need to consider upgrading the memory in the short term.
  2. For memory cache write operations, you can delay writing to reduce the pressure on the database.
    Memory already has good support for read operations, and write operations can also be completed in memory. In the end, we need to write data to the disk. Although we cannot avoid the operation of writing to the disk, we can Delay write operations and merge multiple writes into one write to reduce the pressure on the database. The database provides a similar function, which can merge multiple write operations into one in the cache pool and finally write them to the disk.

2.1.2.3 How to choose memory

  1. Try to use the memory that the motherboard can support the maximum frequency

  2. To form a purchase upgrade, the memory of each channel should be of the same brand, particle size, frequency, voltage, verification technology and model.
  3. Select memory based on database size.

2.2 Disk configuration and selection

Although memory plays a big role in database performance, we cannot ignore the impact of the IO subsystem on performance. . At present, we commonly use the following 4 types of disk options:

2.2.1 Using traditional machine hard drives

Features: Large storage space, low price, most used, most common , reading and writing are slow

  • How to choose a traditional machine hard disk?
  1. Storage capacity
  2. Transmission speed
  3. Access time
  4. Spindle speed
  5. Physical size

2.2.2 Use RAID to enhance the performance of traditional machine hard drives

2.2.2.1 What is RAID

RAID is disk redundancy The abbreviation of Redundant Arrays of Independent Disks. Simply put, the function of RAID is to combine multiple disks with smaller capacity into a group of disks with larger capacity and provide data redundancy to ensure data integrity.

2.2.2.2 RAID Level

##2.2.2.2.1 RAID 0

RAID 0 is the earliest RAID mode, also called data striping. It is the simplest form among component disk arrays. It only requires more than 2 hard disks. It is low cost and can improve the performance and throughput of the entire disk. RAID 0 does not provide redundancy or error recovery capabilities, but is the lowest cost to implement. However, when considering data recovery and reliability factors, RAID 0 has become the most expensive configuration, because there is no redundancy in RAID 0, and the probability of data damage is higher than in a single disk. Because data damage in any disk will cause data loss. For example, a RAID 0 consisting of three disks is three times more likely to be damaged than a single hard disk.
Therefore, RAID 0 is suitable for situations where no single data will be lost, such as: a standby database that can be cloned from other databases at any time or some databases that only need to be used once.
MYSQL advanced for big data learning
Simply put, RAID 0 is to connect hard disks in series to form a larger disk, such as:
MYSQL advanced for big data learning
And in the concurrent process, it can reach the equivalent of 3x the performance of a single hard drive.

2.2.2.2.2 RAID 1

RAID 1 is also called disk mirroring. The principle is to mirror the data of one disk to another disk, that is to say, the data While writing to one disk, an image file will be generated on another restricted disk to ensure the reliability and repairability of the system to the greatest extent without affecting performance.
The difference between it and RAID 0 is that an equal sign is drawn in the middle. The data on both disks are the same and have good redundancy capabilities, but the cost will increase accordingly. When a disk failure occurs, it can run normally, but the failed disk needs to be replaced, otherwise the system will crash. MYSQL advanced for big data learning After replacing a new disk, data synchronization will take a lot of time. Although it will not affect data access, the system performance will be reduced.
RAID 1 can provide good
read
performance in many cases, and redundant data between different disks, so the data redundancy is very good. RAID 1 is better at reading than RAID 0, so it is more suitable for storing logs or similar tasks. 2.2.2.2.3 RAID 5 - Common RAID group

RAID 5 is also called a distributed parity disk array. The
data is spread across multiple disks

through distributed parity blocks, so that if any disk data fails, it can be reconstructed from the parity blocks. But if two disks fail, the entire volume's data cannot be recovered.
It can be seen that each disk has Dp, Cp, Bp, and Ap respectively. If there is a problem with one of the disks, the disk can be recalculated based on the data and parity values ​​of the other three disks. The data. MYSQL advanced for big data learning For RAID 0 and RAID 1, this is the most economical redundant configuration, because the entire array configuration only requires the capacity of one disk.
Writes are slower on RAID 5 because each write requires 2 reads and 2 writes between disks to calculate the value of the stored parity digit, however, both random and sequential reads are fast , because there is no need to calculate parity bits when reading, so RAID 5 is more suitable for read-oriented database services.
The biggest problem that occurs with RAID 5 is when the disk fails, because the data needs to be reallocated to other disks, which will seriously affect the performance of the disk, so it is best to use RAID 5 in the case of re-reading.

2.2.2.2.4 RAID 10 - Commonly used RAID groups

RAID 10 is also called sharded mirroring. It first performs RAID 1 on the disks and then performs RAID 0 on the two sets of RAID 1 disks, so it has good performance in reading and writing. Compared with RAID 5, it is easier to rebuild and faster.


On RAID 10, if one hard disk is damaged, it will have a serious impact on performance, because during the read and write process, two adjacent disks can be read at the same time. If one is damaged, then only reads can be made from a single disk, so in the worst case, our performance will be reduced by 50%. MYSQL advanced for big data learning

2.2.2.3 Selection of RAID level

##RAID 0Cheap, fast, dangerousNoNfastfastRAID 1High-speed reading, simple and safeYes2FastSlowRAID 5Security, cost Trade-offhasN 1fastdepends on the slowest diskRAID 10 Expensive, high-speed, safeHave2Nfastfast

2.2.3 Using solid-state storage SSD and PCIe cards

Solid-state storage is also called flash memory.
Features:

  • Compared with mechanical disks, solid-state disks have better random read and write performance
  • Compared with mechanical disks, solid-state disks have better support for concurrency
  • Compared with mechanical disks, solid-state disks are more susceptible to damage

SSD features:

  • Using the SATA interface, traditional disks can be replaced without any changes
  • SATA interface SSD also supports RAID technology

Features of solid-state storage PCIe card:

  • The SATA interface cannot be used and requires unique drivers and configuration
  • The price is more expensive than SSD, but the performance is better than SSD

Use scenarios of solid-state storage

  • Suitable for situations where there is a large amount of random I/O Scenario
  • Use to solve the I/O bottleneck of single-threaded load

2.2.4 Use network storage NAS and SAN

SAN( Strorage Area Network) and NAS (Network-Attached Storage) are two methods for mounting external file storage devices to the server.

SAN:
The SAN device is connected to the server through optical fiber. The device is accessed through the block interface, and the server can use it as a hard disk.

MYSQL advanced for big data learning
Characteristics of SAN:
MYSQL advanced for big data learning
NAS:
NAS devices use network connections through file-based protocols such as NFS or SMB to access.

2.2.4.1 Scenarios for using network storage

Suitable for database backup.

2.2.4.2 Limitations of network performance

The limitations of network performance are mainly latency and bandwidth.

2.2.4.3 The impact of network on performance

  • The impact of network bandwidth on performance
  • The impact of network quality on performance
    Recommendation:
    • Use high-performance and high-bandwidth network interface equipment and switches
    • Bind multiple network cards to enhance availability and bandwidth
    • Isolate the network as much as possible

2.3 Summary

CPU:

  • 64-bit CPU must work in 64-bit Under the system
  • For scenarios with relatively high concurrency, the number of CPUs is more important than frequency
  • For CPU-intensive scenarios and complex SQL, the higher the frequency, the better

Memory:

  • Choose the memory with the highest frequency that the motherboard can use
  • The size of the memory is very important to performance, so make it as large as possible

I /O subsystem:

  • PCIe -> SSD -> RAID10 -> Disk -> SAN

3 Impact of operating system on performance

Suitable operating systems for MySQL: Windows, FreeBSD, Solaris, Linux

3.1 CentOS system parameter optimization

Kernel related parameters (/ etc/sysctl.conf)

  1. net.core.somaxconn = 65535
    For a port in a listening state, it has its own listening queue. This parameter determines the maximum listening queue of each port. length. The default value of this parameter may be relatively small, which is not enough for large servers. It is generally modified to a value of 2048 or larger.
  2. net.core.netdev_max_backlog=65535
    ##net.ipv4.tcp_max_syn_backlog=65535 The backlog parameter determines the amount of data received on each network interface. When the packet rate is faster than the kernel processor, the maximum number of packets allowed to be sent to the queue, and another parameter determines whether these requests that have not yet obtained the other party's connection can be saved in the queue. the maximum number. Connections exceeding this value may be discarded, so increase the size at the same time.
  3. net.ipv4.tcp_fin_timeout = 10 This parameter is used to control the timeout of the waiting state for tcp connection processing. For systems with relatively frequent connections, a large number of connections are usually in a waiting state. The setting of this parameter is to reduce the connection timeout time and speed up TCP recycling. There are also the following two parameters that have an impact on tcp connections:

    net.ipv4.tcp_tw_reuse = 1, net.ipv4.tcp_tw_recycle = 1 These three The parameters are mainly to speed up TCP recycling. In a high-load system, if the TCP connection is full, an error of connecting to the database 500 will occur, so these three parameters are very useful.
  4. net.core.wmem_default = 87380, net.core.wmem_max = 16777216, net.core.r0mem_default = 87380, net.core.rmem_max = 16777216 The above 4 parameters determine the default and maximum values ​​of the tcp connection receive and send buffer sizes. For the database, the values ​​of these parameters should be adjusted slightly larger.
  5. net.ipv4.tcp_keepalive_time = 120, net.ipv4.tcp_keepalive_intvl = 30, net.ipv4.tcp_keepalive_probes = 3 The above three parameters are used to reduce the amount of tcp system resources occupied by failed connections and speed up the efficiency of resource recycling.
    net.ipv4.tcp_keepalive_time represents the time interval for tcp to send tcp_keepalive detection messages, in seconds. , used to confirm whether the tcp connection is valid. net.ipv4.tcp_keepalive_intvl is used to resend the detection message after detecting that the tcp connection does not respond, in seconds, net.ipv4.tcp_keepalive_probes indicates that the tcp connection is being recognized How many tcp_keepalive probe messages need to be sent before failure. The default values ​​of these three parameters are a little too large for an ordinary system, so they are changed to smaller values ​​here.
  6. kernel.shmmax = 4294967295 This parameter is one of the most important parameters in the Linux kernel parameters and is used to define the maximum value of a single shared memory segment.

    Note:
      This parameter should be set large enough to accommodate the entire Innodb buffer pool size in a shared memory segment.
    • The size of this value is for a 64-bit Linux system. The maximum possible value is the physical memory value - 1 byte. The recommended value is greater than half of the physical memory segment. Generally, it is greater than the size of the Innodb buffer pool. Can take physical memory - 1 byte.
  7. vm.swappiness = 0 This parameter will have a significant impact on performance when memory is insufficient. This parameter tells the Linux system kernel not to use the swap area unless the virtual memory is completely full.

    Linux system memory swap partition: When the Linux system is installed, there will be a
    special disk partition, called system swap partition. If we use free -m to view the system, we can see something similar to the following, where swap is the swap partition. When the operating system does not have enough memory, it will write some virtual memoryto the disk's swap areaand memory swapping will occur. Completely disabling the swap partition on the Linux system where the MySQL service is located will bring the following two risks:
      Reducing the performance of the operating system
    • It is easy to cause
    • memory overflow, crashed , or were Killed by the operating system
Increase resource limits (/etc/security/limit.conf )


limit.confThis file is actually the configuration file of Linx PAM, which is the plug-in authentication module. One of the more important parameter configurations is the limit on the number of open files.

MYSQL advanced for big data learning Conclusion: Increase the number of open files to 65535 to ensure that enough file handles can be opened.
Note: Modifications to this file need to be restarted to take effect.

Disk scheduling policy (/sys/block/devname/queue/scheduler)
You can use the command cat /sys/block/sda/queue/scheduler to view the scheduling policy used by the current disk. The following noop anticipatory deadline [cfq] is the system’s default cfq scheduling policy.
Under the MySQL database service, cfq is not suitable because during the working process of MySQL, cfq will insert some unnecessary requests in the queue, resulting in poor response time.
MYSQL advanced for big data learning
In addition to the cfq scheduling strategy, there are also the following strategies:
noop (elevator scheduling strategy):
MYSQL advanced for big data learning
deadline (deadline scheduling strategy):
MYSQL advanced for big data learning
anticipatory (anticipatory I/O scheduling policy):
MYSQL advanced for big data learning
We can enter the following command to change the disk scheduling policy:
echo schedulerName > / sys/block/sda/queue/scheduler
For example: echo deadline > /sys/block/sda/queue/scheduler

4 file system pair Impact on performance

It is recommended to use the XFS file system. The following parameters need to be configured under EXT3 and EXT4:
MYSQL advanced for big data learning
Mount parameters of the EXT3/4 system (/etc/fstab ):

  • data=writeback | ordered | journal
    This parameter has three optional values, writeback means only metadata is written When entering the log, metadata writing and data writing are not synchronized. This is the fastest configuration and is usually the best choice for InnoDB since InnoDB originally has its own transaction log. ordered will only record metadata, but provides some consistency guarantees. Before writing metadata, the data will be written first to make them consistent. This option is slightly slower than writeback A little, but it's safer to crash. journal Provides the behavior of atomic logs, which will be recorded in the log before the data is written to the final log. This option is obviously unnecessary for InnoDB and is the slowest of the three.
  • noatime, nodiratime
    These two options are used to record the access time of the file and the time of reading the directory. Setting these two parameters can reduce some write operations. The system does not need to write operations to record the above two times when reading files and directories.
    The following are some configurations in the file /dev/sda1/ext4:
    noatime,nodiratime,data=writeback 1 1

5 MySQL Architecture

The top layer of the architecture is called the client. This layer represents the client that can connect to mysql through the mysql connection protocol, such as PHP, JAVA, C API, .Net as well as ODBC, JDBC, etc. It can be seen from here that this layer is not unique to the mysql architecture. Most CS architecture services adopt this architecture. This layer mainly completes some functions such as connection processing, authorization authentication and security. Each client connected to mysql has a thread in the server process. The query of this connection will only be executed in this thread. As we mentioned earlier, each connection query only uses one CPU. core.
Then in the second layer of this system, most of the mysql core services are in this layer, as shown in the figure below.
MYSQL advanced for big data learning
Our commonly used DDL or DML statements are defined on this layer. But we just need to remember one thing. All cross-storage engine functions are implemented in this layer, because this layer is also called the service layer.
The third layer of our structural system is the storage engine layer. MySQL is a very excellent open source database, which defines a series of storage engine interfaces. As long as it meets the requirements of the storage engine, we can develop MySQL Come up with a storage engine that fully meets your needs, such as our commonly used InnoDB. Currently, there are many storage engines supported by mysql, as shown in the following figure:
MYSQL advanced for big data learning
Note: Storage engine It is for tables rather than for libraries (different tables in a library can use different storage engines)
Below we select some commonly used storage engines for a brief explanation. The storage engine used by mysql will The performance of the database has a direct impact. I also hope that you can carefully understand some of the characteristics of the storage engine before using the storage engine.

More related free learning recommendations: mysql tutorial(Video)

Level Features Whether it is redundant Number of disks Read Write

The above is the detailed content of MYSQL advanced for big data learning. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:csdn.net. If there is any infringement, please contact admin@php.cn delete