Home >headlines >Sharing experience: Summary of three years of DBA at Ganji.com

Sharing experience: Summary of three years of DBA at Ganji.com

PHPz
PHPzOriginal
2017-02-15 15:25:372730browse

I joined the job at the beginning of 2013. At that time, I was in the stage of rapid growth in traffic. I gained a lot in my three-year DBA career, but in fact there were more pitfalls (tears... Later, when I was doing development, I gradually realized the "operation and maintenance" and There is indeed a communication problem in "development": knowledge asymmetry. How to solve it? Let's summarize the past three years.

There are a lot of JDs recruited on the market. , find a few as you change, and you will be able to find the commonality immediately

Planning, design, management, and migration of the database system

Daily maintenance, backup, optimization and recovery of the database

Master-Slave architecture construction and maintenance

Business system online support, database design review, providing architecture solutions

The database is not limited to MySQL, Oracle, if not divided into details, there will also be Redis, MongoDB and a series of NoSQL. The work content is the same. The first is high availability and stability. It cannot be shaken today and down tomorrow. The second is data security, such as backup and recovery, 14-year market audit, and active users on the mobile terminal. The data is restored from the backup. It can be seen that the effectiveness of the backup is the top priority. The last one is of course to serve the business and meet the business needs. You cannot go on strike just because your personal life is interrupted. I was called just after watching a movie. When I go back to handle the DB alarm, I feel like scolding my mother.

Tragic cases

Let’s give some tragic cases to make the judges happy~ Since the company has long been gone. No worries here.

1. Delete the entire table

The second-hand car classmate’s fault was that the SQL was misspelled without the where condition, and there was an error when writing the offline script. ... Finally, the DBA restored based on the row binlog. At least 2 times: (

Reflection: rd There are newbies one after another, and it’s useless no matter how many specifications you talk about. There is only one most complete solution, connect to proxy. , restrict all illegal sql. In addition, the rd online verification is not in place, and there must be something missing in the code review

2. Big seller problem

The real estate opened a free port in 2014. In just a few months, the real estate business table exploded to 100G, and some intermediary accounts posted more than 10W, causing abnormal database jitter. Weige temporarily cleaned up the excessive posting records and finally spent three months to slim down the table. , split the text field

Reflection: DBA’s insufficient monitoring of large tables

3. A large number of subqueries cross the main database

The main site master The library was once blocked by a subquery. After investigation, it was found that this problem was caused by a large number of RD subqueries. There were many RDs that wrote requests to the master that should have read the slave, but it did not cause an accident.

Reflection: Ganji DB typical 1 master N slave, without proxy protection, this kind of problem often occurs, and it cannot be solved by standard system alone

4. Report olap library problem

RP Kuwo and Wenwu took the blame, and their performance at the end of the year was at the bottom. The RD before Wenwu took over developed the intermediary merchant reporting system by himself. All calculations were based on DB. When the free port was opened, the amount of data exploded. The MyISAM read-write lock caused a large number of requests to be blocked. I heard that the company paid the merchant 300W because of the reporting continuity problem.

Reflection: This accident needs to be seen from a high place. The free port was opened too suddenly, and the project technical leader did not complete the test. The reporting system has not been designed and is completely handled by a new RD, who is only a college graduate. Looking back, Hadoop Spark can completely handle it. In the end, the DBA failed to track the large table in time and did not discover it in advance.

5. 50G Redis

Real estate business Redis has exploded from 20G to 50G, and it has not been optimized since I left. Later, there was a failure and the data was cleared?

My work

I spend most of my time communicating with developers and talking about life. After automan came online, few people came to me~ There is growth at every stage. There are people and things that I am grateful for. Going to the market gives me a platform to do the things I am interested in. I am very happy.

When I first arrived at the market, SQL was still online on jira. The semi-automation was done by operation and maintenance development students. It was often mentioned in the technical group. Very simple table creation or DML had to be done manually by the DBA. Intervention is very annoying. In addition, many table creation statements were not standardized and were sent back to RD for modification. They were very dissatisfied with this and thought that modifications were insignificant. This resulted in communication costs between RD and DBA. Shizhan would also regularly do database development when he was here. Training and then nothing more.

Started the development of the Automan platform in mid-2014, starting from the front page to the back-end message queue to SQL parser analysis. It was finally launched online with the help of classmate Liu Jun Xianhe. The platform reviews the developed SQL, passes through the sim simulation environment, and then automatically performs backup online, which is much more efficient than manual work.

The principle of this system is similar to other tools on the market, but there are great improvements. Later, I shared it once at a database conference, and I was afraid that I would not be criticized.

DBA experience

Strengthen the foundation: The foundation of DB is naturally stable, stable, and then stable. When there are too many instances, various failures will be encountered basically every day. If the master fails, use MHA switching (the latest one has gtid). If the slave fails, lvs will automatically remove the read traffic. Another one is backup, full incremental, regular backup validity testing, each piece requires human investment.

Hardware priority: There are two types of DB expansion: scale up and scale out. Generally, priority is given to heap hardware. If the buffer is not enough, add more memory. If 128G is not enough, use 192G. If the performance of the disk array card is not good, use SSD. If not, use flash board. In short, give priority to testing the filter hardware and gain time for architecture optimization.

Prepare for a rainy day: Optimize slow SQL and regularly issue reports for RD to tune. Generally, the problem is that the index has not been added. 99% of large SQLs are like this. A small number of them are due to unreasonable table design (no auto-increment). primary key, or frequently modified). Large table monitoring, slimming down where necessary, splitting down fields when necessary, both horizontally and vertically, archiving expired data regularly, basically these are the things.

Combined with business: Some optimizations make DBA tired to death, so it is better for RD to modify a line of code. DBAs should also have more contact with the business and understand business implementation. They don't need to contribute much to the business and don't take the blame... Just kidding. Understanding business allows you to think from a higher perspective, which is very meaningful.

Learn to say no: This rejection is not about going on strike and refusing to work, but about distinguishing between the reasonableness and urgency of the needs. Those that are unreasonable and not urgent can be dealt with directly. Those that are urgent but unreasonable can be temporarily passed and solved quickly. The problem can be corrected later. For example, olap runs in an online library, count(*) counting SQL can run counters asynchronously, and Redis is a good thing.

Learn to communicate: I have been working for a few years, I am still learning this, and I have made many mistakes. Communicate rights and responsibilities and set a timetable.

Study down-to-earth: Looking back, DBAs did not do well enough back then. Some of the reasons were due to lack of development capabilities, and many ideas stopped there. Operation and maintenance personnel must have development skills and be more sophisticated than business RDs in order to do a good job in operation and maintenance.

The contradictions and shields of operation and maintenance RD

KPIs are different, and the focus is naturally different. The first-line students also lack experience, especially those who have just started working 1~2 years, resulting in an asymmetry of information and knowledge. It is not difficult to solve this problem:

Newcomers need to be guided by mentors, and it would be most irresponsible to leave them alone. I feel nice has done a good job in this aspect. You still have to praise when you deserve it.

The support team must have sufficient wiki business documentation.

Automation is constrained by technology, not manual labor. Business interfaces are more constrained than before, and now all services are using thrift.

The memory of going to the market has become increasingly blurry, only...

After writing part of the summary, my former colleagues said that some were missing, so I will append them all to the end. , the copyright does not belong to me:)

20170214 The following content is from a former colleague: Li Rui

Recall the DBA life of going to the market

To summarize, there are many faults, which need to be dealt with at any time. These will be a little better until automan comes out and forces rd to go online through the platform.

Due to the problem of raid0, I encountered problems with the master hard disk at least 4-5 times and needed urgent treatment.

tg Encountered it once, ms cut times, it seems to be a disk problem.

Other slave backup machines have more hard disk failures, with up to 4 disk problems needing to be dealt with a week. The mysqldb in Ganji is generally at least 100G in size, and the disk of the data report database is 2.3T. It is impossible to restore the slave database through backup. It took me 2-3 weeks to get it done

im Swap problem

Im the swap problem must be a SQL problem. The main SQL query is to obtain data through order by and count. This problem has been unsolvable from the time I went to the market to the time I came out. The only way to solve the swap problem is to manually switch traffic to LVS, restart the slave, and then switch traffic back to LVS. Almost 1-2 times a week is needed. I told IM colleagues several times about IM SQL problems, hoping to make a counter for the count query, but in the end nothing happened. Turn off swap because you are afraid that the server will frequently oom. Finally, after classmate Zhao Shenju came, after turning on the parameters of preheating innodb_buffer_pool, the slave can be restarted directly without fear of a sudden increase in load due to preheating problems. Other classmates Zhao Shenju changed the numa limit memory, but the im swap was not solved in the end.

Backup

Backup problems, 1 is a disk space problem, and 1 is a raid0 problem. .

After you left, I supported myself independently for a month until Bi Changqi came. There were 6 or 7 backup machines, and the hard drives broke one after another. Log library, emp, and Wang Xufeng group, I forgot the name of the business line, the data increased to 800G, the backup machine was broken, and there was insufficient space. I simply stopped the backup, and finally only ensured the backup of some large databases such as ms, hp, tg, and tc. This 58-year-old colleague will probably be despised by them after taking over.

In fact, after Huawei's 32T backup machine came, the backup mechanism should be changed. Blame me

There are also cross-machine backups. Every 2 months, the four 2T disks will be full. , replace, move the disk, manually mount the RAID disk, and manually record in Excel. In the end, these disks were actually used to search for data.

Disk problem

Hp There are two large tables and the data needs to be cleaned regularly. Ms disk grows at a rate of 10G per day, and Ms requires a pcie card. Finally, it can be expanded from 800 to 1200. It can last for several months. Ms has several machines, and in the end they are only about 10G short of being full. All kinds of deleting logs, all kinds of moving data, and making up for the east wall. (I know that using a 400G SSD with xfs can save 20G of space compared with ext4, which is just enough for ms). Moreover, as the disk grows, the The backup machine has insufficient disk space, and the sim machine (which provides read-only services for development, I forgot what it is called) has insufficient space. There is also a report library. I want to apply for disks and servers, but there is no space in the cabinet. I just ran with a single library for a long time.

There is also the sql generated by Wang Xufeng group and tc through the framework

Generate redundant sub-tables, varchar type field conditions do not add single quotes, plus online table creation does not add indexes , sql needs to be checked regularly and optimized.

Painful hp, main library split.

It took more than a year and was not completely separated. Finally, I heard from Bi Changqi that Guazi used cars were split from these warehouses. Top-down, forced splitting. Split in 1-2 days.

php short connection, the number of connections is full

This last one, after you left, I occasionally analyzed the hp full log and found that hp every 1-2 times connection, accompanied by an empty connection. Connect does nothing quit. I don’t know what caused this problem. After correcting it, the problem with the number of HP connections has been resolved.

In summary, in Ganji, because of the skyrocketing data, we just responded blindly and did not cut the knot quickly and conduct spin-offs. Also, it is really necessary to have a DBA platform to manage, monitor, and submit SQL for review. It was not until later that I was able to reluctantly write it by imitating Automan.

The article is contributed by Zerun, a netizen of php Chinese website. Please indicate when reprinting. The address of this article: http://www.php.cn/toutiao-352102.html

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn