Summary of Douban Architecture Change Sharing Session

Home

Backend Development

PHP Tutorial

Summary of Douban Architecture Change Sharing Session_PHP Tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 13, 2016 pm 05:17 PM

server。sharechangeaboutSummarizenumberArchitectureregisteruserat presentMain pointsDouban

The key points are as follows:

Currently there are 23 pc servers

The number of pv per day is about 2k million. The number of registered users is 3 million.
Most of the data in the table has tens of millions of rows.

5-person algorithm team. In addition, there are a total of 11 developers, including full-time and part-time (in the past, there were only 10 people when I saw the technology shared by Baixing.com)

In 2006, there were about 1.2 million dynamic requests every day. At this time, the main bottleneck is disk i/0. Get venture capital and have money to buy hardware equipment. Purchase two iu servers (dual core, 4g memory)
one as an application server and one as a database server, migrate to a dual-line IP computer room, and use dns to resolve IP addresses of different network segments (find out which network segments belong to China Telecom) Which network segments are Netcom, and then analyze them yourself). After reading the computer room adjustment mentioned at the end of the speech, I felt that this was actually a detour. You can choose a good computer room to solve the DNS analysis aspect (later I concluded that relying on IP segments to distribute data is unreliable)

Specific How to do it is to put it in a computer room that supports multi-line (education network Tietong, etc.). The Alibaba Cloud used by our company now is multi-line)
In this way, there is no need to allocate multiple IP segments by yourself (that is, to determine the access user Is it Telecom or Netcom, etc.).

Two principles for using memory cache (Douban uses memcached):
1. For data that needs to consume resources
2. Data that needs to be reused. If it only needs to be used once, then even if it consumes resources, there is not much point in throwing it into the cache

Understand: Memory cache also requires memory, there is no need to waste it. If it does not need to be reused, it is wasteful to throw it into the memory (after all, memory is not cheap and takes up server resources)

Douban’s memached hit rate is quite high. This also relieves a lot of stress.

InnoDB has good concurrent access support because it supports row-level storage. Whether to use myisam or innodb, their business characteristics are: use myisam to read more and write less, use innodb to write more and read less Function module division table. A table related to a functional module is placed in a library), as mentioned, the classic mysql master-slave architecture is used. So each library is actually duplicated in three copies (the primary and secondary libraries he mentioned). It should be three mysql slave servers

that operate multiple libraries and use cursors to obtain specific libraries and specific tables. Pass in the parameters (I don’t understand the details)

The problem of database master-slave replication delay has always been a common problem.

Buying a hard drive is a lesson: in the beginning, you would rather invest more money to buy a better disk, because it is impossible to upgrade the disk. By then the website can no longer hold it. Still have to change. Well, at the beginning, I would rather spend more money and buy high-speed disks, because if the business develops quickly, it will have to be replaced. Even though it is more expensive, the disk is still not wasted.

When there were 2 million dynamic requests per day, Douban mentioned that static small file services (user avatars, cover pictures) made disk i/0 a bottleneck. In the past, I was stupid enough to send pictures Put them all under one directory. There are hundreds of thousands of small files under this directory (which directly results in the inability to use the ls command, and the server will die as soon as it is used). At this time, the files are divided into directories. Divide each directory into 10,000 files.

Have a dedicated data mining team. The algorithm team performs matrix calculations and puts the results into mysql for front-end query and display.

Douban’s fs is specially developed for image storage. In fact, the mechanism is based on Amazon's, and three copies of data are written when writing.

Disk random seek is more important than throughput. The performance bottleneck at that time was the disk seek speed (this is similar to the disk disk caused by a large number of image accesses when looking at Taobao's image file system analysis. The delay caused by frequent head positioning is similar)

Later, all myisam tables were changed to innodb tables.

Innodb’s cache: is self-managed in the process (that is, in memory), while myisam’s cache is based on files (controlled by the operating system). In the past, both myisam tables and innodb tables were used, which caused the two types of tables to compete with each other for memory, which was not efficient. All indexes were replaced with the innodb storage engine (I don’t understand this very well, I only understand that the consideration is to better utilize the memory)

Application server failure: nginx comes with its own functions.

The traffic of pictures has become a big cost: it is cheaper to move to the Tianjin computer room. The cabinet is relatively cheap, and all data mining data and image data can be moved there.

Two computer rooms in Beijing and Tianjin. Each of them builds the master-slave structure of mysql.

Search: I have been using the full-text index of mysql before. Later, it was migrated to use sphinx (this was used in combination with mysql as a storage engine of mysql), and later it became xapain

Why not use sphinx? No detailed explanation

Use MogileFS to store pictures, and later developed doubanfs storage. Reason for migration: Mogilefs has a performance bottleneck. Since mogilefs stores metadata (namespace, and file location) in mysql, as the number of database rows increases, it will become slower and slower. A large number of small files need to be read from the database, which also affects the speed. The number of rows at that time was growing very fast, and the bottleneck at that time was the mysql database.

Large fields affect the performance of the database. In fact, the number of rows in the data table is not many. It’s the impact of large fields. The large text fields are removed and stored in doubanDB developed by myself (it is a key-value database, which is simplified with reference to Amazon's dymamo). The underlying storage is based on tokyocabinet. Later, doubanfs was rewritten and implemented based on doubandb to store pictures in it.

Use a dual master solution. This solves the replication delay problem, because both writing and reading are to the same master, and the data read is the latest. In the past: writing from the master and then reading from the slave, there was a data delay

deploying lvs.

Previously used spread as the message queue, and later used rabbitMQ instead

======================== =================================

Summary: It is not feasible to copy its architecture and technical solutions. Only by learning from his mistakes and the design ideas behind it can we learn the essence (mainly understand why it is done that way and what considerations it is based on).

Lessons: disk selection and computer room selection. Choose a disk with a fast RPM, and it’s worth the higher initial cost.

Sub-library, first divide the area from a functional perspective. There is no need to do horizontal partitioning yet. It is a necessary stage to put the function-related tables in a library or on a separate server.

It is worthwhile to spend money on memory. A machine can never have too much memory. Databases consume more memory. General memory often becomes a bottleneck (a large number of connections and calculation data can cause memory problems). not enough). Memcached is not cheap (network i/0, consumes cpu). Be careful what you put into memcached.

Avoid database join operations (this is similar to the point of view shared by Shi Zhan before, reduce join operations and prefer to split the data into multiple times to obtain data. Facebook's architecture also mentions not to do JOIN operations)

The overall feeling is that the database experience I learned from Douban is in terms of sub-databases. Their visit level does not need to be divided horizontally, it can be divided into databases and partitioned according to business functions. Tables related to a business function module are split into the same library. Then perform master-slave synchronization on the database server to maintain hot backup of data.

Sharding’s application scenarios in the industry are basically situations where read applications are relatively heavy, and transaction security requirements are not high. Such scenarios will be very suitable.

Sata*3 checked and found that 450G costs more than 1,000 yuan each.

The SATA hard drive has a relatively high failure rate, so I replaced it with a SCSI hard drive.

For image storage or small file storage, due to the large volume (traffic cost, storage cost), we developed our own file system

If image storage relies on a database for storage, the amount of data will be large After that, it will indeed become a bottleneck (no wonder Taobao’s image file system hides part of the metadata in the saved file name of the image)

Question: Beijing and Tianjin cross computer rooms, and mysql on both sides is running between them. Synchronize data, or the data mining program in Tianjin writes data to Beijing. What is the speed?

I checked the information and found that it generally requires the use of a dedicated optical fiber network channel.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

The Continued Use of PHP: Reasons for Its EnduranceApr 19, 2025 am 12:23 AM

What’s still popular is the ease of use, flexibility and a strong ecosystem. 1) Ease of use and simple syntax make it the first choice for beginners. 2) Closely integrated with web development, excellent interaction with HTTP requests and database. 3) The huge ecosystem provides a wealth of tools and libraries. 4) Active community and open source nature adapts them to new needs and technology trends.

PHP and Python: Exploring Their Similarities and DifferencesApr 19, 2025 am 12:21 AM

PHP and Python are both high-level programming languages that are widely used in web development, data processing and automation tasks. 1.PHP is often used to build dynamic websites and content management systems, while Python is often used to build web frameworks and data science. 2.PHP uses echo to output content, Python uses print. 3. Both support object-oriented programming, but the syntax and keywords are different. 4. PHP supports weak type conversion, while Python is more stringent. 5. PHP performance optimization includes using OPcache and asynchronous programming, while Python uses cProfile and asynchronous programming.

PHP and Python: Different Paradigms ExplainedApr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP and Python: A Deep Dive into Their HistoryApr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Choosing Between PHP and Python: A GuideApr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP and Frameworks: Modernizing the LanguageApr 18, 2025 am 12:14 AM

PHP remains important in the modernization process because it supports a large number of websites and applications and adapts to development needs through frameworks. 1.PHP7 improves performance and introduces new features. 2. Modern frameworks such as Laravel, Symfony and CodeIgniter simplify development and improve code quality. 3. Performance optimization and best practices further improve application efficiency.

PHP's Impact: Web Development and BeyondApr 18, 2025 am 12:10 AM

PHPhassignificantlyimpactedwebdevelopmentandextendsbeyondit.1)ItpowersmajorplatformslikeWordPressandexcelsindatabaseinteractions.2)PHP'sadaptabilityallowsittoscaleforlargeapplicationsusingframeworkslikeLaravel.3)Beyondweb,PHPisusedincommand-linescrip

How does PHP type hinting work, including scalar types, return types, union types, and nullable types?Apr 17, 2025 am 12:25 AM

PHP type prompts to improve code quality and readability. 1) Scalar type tips: Since PHP7.0, basic data types are allowed to be specified in function parameters, such as int, float, etc. 2) Return type prompt: Ensure the consistency of the function return value type. 3) Union type prompt: Since PHP8.0, multiple types are allowed to be specified in function parameters or return values. 4) Nullable type prompt: Allows to include null values and handle functions that may return null values.

See all articles