Home  >  Article  >  Database  >  Learn to process tens of millions of records in SQL Server

Learn to process tens of millions of records in SQL Server

coldplay.xixi
coldplay.xixiforward
2020-11-27 16:42:299357browse

sqlTutorial column introduces how to process tens of millions of records.

Learn to process tens of millions of records in SQL Server

Recommended: sql tutorial

Project background

This is I did a project for a certain data center. The difficulty of the project was outrageous. This project really made me feel that the shopping mall is like a battlefield, and I am just a soldier in it. There are too many tactics and too many conflicts between senior management. Contest, too many inside stories. Regarding the specific situation of this project, I will write a related blog post when I have time.

This project requires environmental monitoring. For the time being, we call the monitored equipment a collection device, and the attributes of the collection equipment are called monitoring indicators. Project requirements: The system supports no less than 10w monitoring indicators, the data update of each monitoring indicator shall not be greater than 20 seconds, and the storage delay shall not exceed 120 seconds. Then, we can get the ideal state through simple calculations - the data to be stored is: 30w per minute, 1800w per hour, which is 432 million per day. In reality, the amount of data will be about 5% larger than this. (In fact, most of it is information garbage, which can be processed through data compression, but if others want to mess with you, what can you do)

The above are the indicators required by the project. I think there are many big data processing Experienced students will laugh at it, is that all? Well, I have also read a lot about big data processing, but I have never dealt with it before. Looking at other people's clear explanations, it seems that it is really easy to solve what is distributed and what separates reading and writing. However, the problem is not that simple. As I said above, this is a very bad project and a typical project of vicious competition in the industry.

  1. There are no more servers, but in addition to the database and centralized collector (that is, data analysis, alarm, and storage programs), this server also supports the 300,000-point northbound interface (SNMP). Before the program was optimized, the CPU occupied more than 80% all year round. Because the project requires the use of dual-machine hot standby, in order to save time and reduce unnecessary trouble, we put related services together so that we can make full use of the characteristics of HA (externally purchased HA system)
  2. System data The accuracy requirements are extremely abnormal, requiring that from the bottom collection system to the top-level monitoring system, no data should be different.
    Our system architecture is as follows. It can be seen that the pressure on the database is very high, especially on the LevelA node:
  3. The hardware configuration is as follows:
    CPU: Intel® Xeon® Processor E5-2609 (4 cores, 2.40GHz, 10MB, 6.4 GT/s)
    Memory: 4GB (2x2GB) DDR3 RDIMM Memory, 1333MHz, ECC
    Hard disk: 500GB 7200 RPM 3.5'' SATA3 hard disk, Raid5.
  4. Database Version
    uses the SQLServer2012 standard version. The genuine software provided by HP lacks many NB functions of the enterprise version.

Recommend your own linuxC/C communication group: 973961276! I have sorted out some learning books, video materials and interviews with major manufacturers that I think are better and shared them in the group files. Friends who need them can add them by themselves! ~

Writing Bottleneck

The first obstacle we encountered was that we found that under the existing program, SQL Server simply could not handle so many What is the specific situation of data volume?

Our storage structure

Generally in order to store a large amount of historical data, we will perform a physical table partition, otherwise There are millions of records every day, and hundreds of millions in a year. Therefore, our original table structure is like this:

CREATE TABLE [dbo].[His20140822](
    [No] [bigint] IDENTITY(1,1) NOT NULL,
    [Dtime] [datetime] NOT NULL,
    [MgrObjId] [varchar](36) NOT NULL,
    [Id] [varchar](50) NOT NULL,
    [Value] [varchar](50) NOT NULL,
 CONSTRAINT [PK_His20140822] PRIMARY KEY CLUSTERED (
    [No] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

No is used as the unique identifier, collection device Id (Guid), monitoring indicator Id (varchar(50)), recording time, and recording value. And use the collection device ID and monitoring indicator ID as an index to facilitate quick search.

Batch writing

The writing was done using BulKCopy. Yes, that’s it. It claims to write millions of records in seconds

    public static int BatchInert(string connectionString, string desTable, DataTable dt, int batchSize = 500)
    {
        using (var sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.UseInternalTransaction)
        {
            BulkCopyTimeout = 300,
            NotifyAfter = dt.Rows.Count,
            BatchSize = batchSize,
            DestinationTableName = desTable        })
        {
            foreach (DataColumn column in dt.Columns)
                sbc.ColumnMappings.Add(column.ColumnName, column.ColumnName);
            sbc.WriteToServer(dt);
        }

        return dt.Rows.Count;
    }

What's the problem?

The above architecture is OK for 40 million data per day. However, when the configuration was adjusted to the above background, the centralized monitoring program overflowed the memory. Analysis revealed that too much data was received and placed in the memory, but there was no time to write it to the database, which ultimately resulted in the generated data Data larger than consumed will cause memory overflow and the program will not work.

Where is the bottleneck?

Is it because of the RAID disk problem? Is it a data structure problem? Is it a hardware problem? Is it a problem with the SQL Server version? Is it a problem that there is no partition table? Or is it a program issue?

At that time, we only had one week. If not, the project supervisor would ask us to leave. So, we had the feat of working continuously for 48 hours, and we had to call people everywhere to ask for help...

However, what is needed at this time is to calm down, and calm down again...SQLServer version? hardware? It's unlikely to be replaced at the moment. RAID disk array, probably not. So what is it? I really can’t calm down.

You may not understand the tense atmosphere at the scene. In fact, after such a long time, it is difficult for me to return to that situation. But it can be said that maybe we have various methods now, or we have more thoughts as outsiders, but when a project pressures you to the point of giving up, your thoughts and considerations at that time are restricted by the on-site environmental factors. Major deviations may occur. It may make you think quickly, or it may cause your thinking to stagnate. In this high-pressure environment, some colleagues even made more low-level mistakes, their thinking was completely messed up, and their efficiency was even lower... They didn't sleep a wink for 36 hours, or they just stayed on the construction site (there was mud everywhere on rainy days, and they were dry) If it’s done, it will be all mud by then) Squint for two or three hours, and then continue to do it for a week! Or continue!

Many people have given a lot of ideas, but they seem to be useful and not useful. Wait, why "seems to work, but doesn't seem to work"? I vaguely seem to have grasped a hint of direction. What is it? By the way, verification, we are now running in a live environment. There were no problems before, but it does not mean that there are no problems under current pressure. To analyze such a small function in a large system has too big an impact, so we should break it down. Yes, it is a "unit test", which is a test of a single method. We need to verify each function. Where does each independent step take?

Step by step test and verify system bottlenecks

Modify the parameters of BulkCopy
First of all, what I think of is to modify the parameters of BulkCopy , BulkCopyTimeout, BatchSize, constant testing and adjustment, the results always fluctuate within a certain range, but there is no actual impact. It may affect some CPU counts, but it is far from meeting my expectations. The writing speed still fluctuates between 10,000 and 2,000 times in 5 seconds, which is far from the requirement to write 20,000,000 records in 20 seconds.

Storage by collection device
Yes, the above structure is a record for each value of each indicator. Is it too much waste? So is it feasible to use the collection time of the collection device as a record? The question is, how to solve the problem of different attributes of different collection devices? At this time, a colleague showed his talents. The monitoring indicators and monitoring values ​​can be stored in XML format. Wow, can this happen? For queries, you can use for XML.

So we have this structure: No, MgrObjId, Dtime, XMLData

The result verification is slightly better than the above, but not too obvious.

Data table partition???
I had not learned this skill at that time. After reading the articles on the Internet, it seemed quite complicated. I didn’t have much time, so I didn’t dare to try it.

Stop other programs
I know this will definitely not work, because the software and hardware architecture cannot be modified temporarily. But I want to verify whether these factors are affected. It turned out that the prompt was indeed obvious, but it still did not meet the requirements.

Is this the bottleneck of SQL Server?
No idea, is this the bottleneck of SQL Server? I checked the relevant information online and found that it may be the bottleneck of IO. Damn it, what else can I do? I need to upgrade the server and replace the database. But will the project party provide it?

Wait, there seems to be another thing, index, index! The existence of the index will affect insertion and update

Remove the index

Yes, the query will definitely be slower after removing the index, but I must first verify whether removing the index will speed up writing . If you decisively remove the indexes of the MgrObjId and Id fields.

Run it, and a miracle happened. Each time 100,000 records are written, it can be written within 7 to 9 seconds, thus meeting the system requirements.

How to solve the query?

A table needs more than 400 million records a day, which is impossible to query without an index. what to do! ? I thought of our old method again, physical sub-tables. Yes, originally we divided the schedule by days, so now we divide the schedule by hours. Then there are 24 tables, each table only needs to store about 18 million records.

Then query the historical records of an attribute in one hour or several hours. The result is: slow! slow! ! slow! ! ! It is simply unimaginable to query more than 10 million records without indexing. What else can be done?

Continue to divide the table, I thought, we can also continue to divide the table according to the underlying collector, because the collection equipment is different in different collectors, then when we query the historical curve, we can only check a single indicator The historical curve can be scattered in different tables.

说干就干,结果,通过按10个采集嵌入式并按24小时分表,每天生成240张表(历史表名类似这样:His_001_2014112615),终于把一天写入4亿多条记录并支持简单的查询这个问题给解决掉了!!!

查询优化

在上述问题解决之后,这个项目的难点已经解决了一半,项目监管也不好意思过来找茬,不知道是出于什么样的战术安排吧。

过了很长一段时间,到现在快年底了,问题又来了,就是要拖死你让你在年底不能验收其他项目。

这次要求是这样的:因为上述是模拟10w个监控指标,而现在实际上线了,却只有5w个左右的设备。那么这个明显是不能达到标书要求的,不能验收。那么怎么办呢?这些聪明的人就想,既然监控指标减半,那么我们把时间也减半,不就达到了吗:就是说按现在5w的设备,那你要10s之内入库存储。我勒个去啊,按你这个逻辑,我们如果只有500个监控指标,岂不是要在0.1秒内入库?你不考虑下那些受监控设备的感想吗?

但是别人要玩你,你能怎么办?接招呗。结果把时间降到10秒之后,问题来了,大家仔细分析上面逻辑可以知道,分表是按采集器分的,现在采集器减少,但是数量增加了,发生什么事情呢,写入可以支持,但是,每张表的记录接近了400w,有些采集设备监控指标多的,要接近600w,怎么破?

于是技术相关人员开会讨论相关的举措。

在不加索引的情况下怎么优化查询?

有同事提出了,where子句的顺序,会影响查询的结果,因为按你刷选之后的结果再处理,可以先刷选出一部分数据,然后继续进行下一个条件的过滤。听起来好像很有道理,但是SQLServer查询分析器不会自动优化吗?原谅我是个小白,我也是感觉而已,感觉应该跟VS的编译器一样,应该会自动优化吧。

具体怎样,还是要用事实来说话:

结果同事修改了客户端之后,测试反馈,有较大的改善。我查看了代码:

难道真的有这么大的影响?等等,是不是忘记清空缓存,造成了假象?
于是让同事执行下述语句以便得出更多的信息:

--优化之前DBCC FREEPROCCACHEDBCC DROPCLEANBUFFERSSET STATISTICS IO ONselect Dtime,Value from dbo.his20140825 WHERE  Dtime>=&#39;&#39; AND Dtime<=&#39;&#39; AND MgrObjId=&#39;&#39; AND Id=&#39;&#39;SET STATISTICS IO OFF--优化之后DBCC FREEPROCCACHEDBCC DROPCLEANBUFFERSSET STATISTICS IO ONselect Dtime,Value from dbo.his20140825 WHERE MgrObjId=&#39;&#39; AND Id=&#39;&#39; AND Dtime>=&#39;&#39; AND Dtime<=&#39;&#39;SET STATISTICS IO OFF

结果如下:

优化之前反而更好了?

仔细查看IO数据,发现,预读是一样的,就是说我们要查询的数据记录都是一致的,物理读、表扫描也是一直的。而逻辑读取稍有区别,应该是缓存命中数导致的。也就是说,在不建立索引的情况下,where子句的条件顺序,对查询结果优化作用不明显

那么,就只能通过索引的办法了。

建立索引的尝试

建立索引不是简单的事情,是需要了解一些基本的知识的,在这个过程中,我走了不少弯路,最终才把索引建立起来。

下面的实验基于以下记录总数做的验证:

按单个字段建立索引
这个想法,主要是受我建立数据结构影响的,我内存中的数据结构为:Dictionary44c8bf3c168555d7e46180150f40c496>。我以为先建立MgrObjId的索引,再建立Id的索引,SQLServer查询时,就会更快。

先按MgrObjId建立索引,索引大小为550M,耗时5分25秒。结果,如上图的预估计划一样,根本没有起作用,反而更慢了。

按多个条件建立索引
OK,既然上面的不行,那么我们按多个条件建立索引又如何?CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id,Dtime)

结果,查询速度确实提高了一倍:

等等,难道这就是索引的好处?花费7分25秒,用1.1G的空间换取来的就是这些?肯定是有什么地方不对了,于是开始翻查资料,查看一些相关书籍,最终,有了较大的进展。

正确的建立索引

首先,我们需要明白几个索引的要点:

  • After indexing, sorting by the least repeated index fields will achieve the best effect. For our table, if a clustered index of No is established, it is best to put No first in the where clause, followed by Id, then MgrObjId, and finally time. If the time index is an hour, , it is best not to use
  • The order of the where clauses determines whether the query analyzer uses the index to query . For example, if the index of MgrObjId and Id is established, then where MgrObjId='' and Id='' and Dtime='' will use index search, and where Dtime='' and MgrObjId=' ' and Id='' will not necessarily use index search.
  • Put the result columns of non-indexed columns in the included columns. Because our condition is MgrObjId, Id and Dtime, we only need to include Dtime and Value in the return result. Therefore, put Dtime and Value in the included column, and the returned index result will have this value. There is no need to check the physical table anymore. achieve optimal speed.

Following the above principles, we create the following index: CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id) INCLUDE(Value,Dtime)

The time taken is: more than 6 minutes, and the index size is 903M.

Let’s take a look at the estimated plan:

As you can see, the index is completely used here, with no additional consumption. The actual execution results took less than 1 second, and the results were filtered out from the 11 million records in less than a second! ! Awesome! !

How to apply index?

Now that writing is completed and reading is completed, how to combine them? We can index the data from one hour ago, but not the data from the current hour. That is, don’t create indexes when creating tables! !

How else can we optimize

You can try to separate reading and writing and write two libraries, one is a real-time library and the other is a read-only library. The data within an hour is queried in the real-time database, and the data an hour ago is queried in the read-only database; the read-only database is stored regularly and then indexed; data over a week is analyzed, processed and then stored. In this way, no matter what time period the data is queried, it can be processed correctly - querying the real-time database within an hour, querying the read-only database within an hour to a week, and querying the report database a week ago.

If you do not need physical table sharding, you can rebuild the index regularly in the read-only database.

Summary

How to process billions of data (historical data) in SQL Server, you can proceed as follows:

  • Remove All indexes of the table
  • Use SqlBulkCopy to insert
  • Divide tables or partitions to reduce the total amount of data in each table
  • Create an index after a table is completely written
  • Specify the index fields correctly
  • Put the fields you need to use into the containing index (everything is included in the returned index)
  • Only return when querying Required fields                                                                                           

The above is the detailed content of Learn to process tens of millions of records in SQL Server. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:learnku.com. If there is any infringement, please contact admin@php.cn delete
Previous article:sql知识点小汇总Next article:认识SQL 高级进阶