Home  >  Article  >  System Tutorial  >  Architecture evolution and design exploration of Ele.me

Architecture evolution and design exploration of Ele.me

WBOY
WBOYforward
2024-01-03 09:12:251368browse
Introduction An industry model, quickly produce it. "Fast" is the first priority, and there is no need to spend too much energy on architectural design. Only when the website enters the expansion period does it need to invest more energy in the architecture to carry the website's traffic when it explodes. Ele.me has been established for 8 years, and now the daily order volume has exceeded 9 million. We also have a relatively complete website structure.
1. Website infrastructure

In the early days, we used a framework that made it easier to expand SOA. We use the SOA framework to solve two things:

1. Division of labor and collaboration

In the early days of the website, there might only be 1 to 5 programmers. At that time, everyone could just be busy with the same thing. They all understand each other's work and often solve problems by "yelling".

But as the number of people increases, this method is obviously not feasible. It is impossible for one person to update the code and then put all other people's codes online again, right? Therefore, we must consider the issue of division of labor and collaboration.

2. Rapid expansion

In the past, the order volume may have ranged from 1k to 10,000. Although it has increased by 10 times, the total volume is not very high, and the pressure on a website is not that great. When the order volume actually goes from 100,000 to 1,000,000, and from 1,000,000 to 2,000,000, the number may only expand 10 times, but it is a huge challenge to the architecture of the entire website.

Our background is that we have exceeded 1 million in 2014 and now have 9 million. The technical team has grown from more than 30 people at the beginning to a team of more than 900 people now. At this time, division of labor and collaboration are a huge challenge. The division and integration of services and the division and integration of teams require a framework system to support them. This is also a role of the SOA framework.

Looking at our current situation, the middle is our entire architecture system, and the right side is some foundations related to servitization, including basic components or services.

Let’s talk about language first. Our original website was based on PHP, and then we slowly transformed.

The founders are all college students starting their own businesses, so of course Python is a good first choice. So far, Python is also a good choice, but why should we expand to Java and Go?

Many people can write Python, but not many people can really do it well. As the business grows, more developers are needed. Taking into account the mature ecological environment of Java and the emerging Go ecosystem, we finally chose an ecosystem where Python, Java, and Go coexist in multiple languages.

WebAPI mainly performs some common operations unrelated to business logic such as HTTPS uninstallation, current limiting, and security verification.

Service Orchestrator is a service orchestration layer that realizes protocol conversion of internal and external networks and aggregation and tailoring of services through configuration.

On the right side of the architecture diagram are some auxiliary systems surrounding these service-oriented frameworks, such as the Job system for regularly executing a task. We have nearly 1,000 services. How do we monitor these systems? So there must be a monitoring system. When there were only more than 30 people at the beginning, we were better at running to the machine to search for logs. But when there are more than 900 people, you can't all go to the machine to search for logs. We need a centralized logging system. Other systems will not be described one by one here.

Rome was not built in a day, infrastructure is an evolutionary process. Our energy is limited, so what should we do first?

2. Service split

When the website becomes larger, the original structure cannot keep up with the pace of development. The first thing we have to do is:

Split the big Repo into a small Repo, split the big service into small services, and split our centralized basic services into different physical machines.

It took more than a year to complete the service split alone. This is a relatively long process.

In this process, we must first make a good definition of the API. Because once your API is online, the cost of making some modifications is quite high. There will be many people relying on your API, and many times you don't know who depends on your API. This is a big problem.

Then abstract some basic services. Many original services are actually coupled in the original business code. For example, take the payment business. When the business is very simple, tightly coupled code does not matter. But when more and more expanded businesses require payment services, do you have to do one for each business (for example, payment function)? ? Therefore, we need to extract these basic services, such as payment services, SMS services, push services, etc.

Dismantling services seems simple and of little value, but this is exactly what we have to do from the beginning. In fact, during this period, all the previous architectures can be postponed, because if you don’t make architectural adjustments, people will not die, but if you don’t dismantle services, people will really die.

Service splitting must be a long process, but it is actually a very painful process and requires a lot of system engineering of supporting systems.

3. Publishing system

Publishing is the biggest instability factor. Many companies have strict limits on the release time window, for example:

  • Only two days per week can be published;
  • It is absolutely not allowed to post on weekends;
  • Publishing is absolutely not allowed during peak business periods;
  • etc……

We found that the biggest problem with publishing is that there is no simple executable rollback operation after publishing. Who will perform the rollback operation? Can the release personnel perform it, or does it require a dedicated person to perform it? If it is a publisher, the publisher does not work online 24 hours a day. What should I do if there is a problem and I cannot find someone? If there is a dedicated person to perform rollback, and there is no simple and unified rollback operation, then this person needs to be familiar with the publisher's code, which is basically not feasible.

So we need a publishing system. The publishing system defines a unified rollback operation. All services must follow the rollback operation defined by the publishing system.

It is a mandatory requirement for everyone to connect to the publishing system in Ele.me. All systems must be connected to the publishing system. The framework of the publishing system is very important. This thing is actually very important to the company and needs to be considered in the first priority queue.

4. Service Framework

The next step is Ele.me’s service framework, which splits a large Repo into a small Repo and a large service into a small service to make our services as independent as possible. This requires a set of Distributed service framework to support.

The distributed service framework includes service registration, discovery, load balancing, routing, flow control, circuit breaker, downgrade and other functions, which will not be discussed one by one here. As mentioned before, Ele.me is a multi-language ecosystem, including Python and Java. Our service-oriented framework is also multi-language. This will have an impact on our subsequent selection of some middleware, such as the DAL layer.

5. DAL data access layer

When the business volume becomes larger and larger, the database will become a bottleneck.

In the early stage, the performance of the database can be improved by upgrading the hardware. for example:

  • Upgrade to a machine with more CPUs;
  • Change the hard drive to SSD or something more advanced.

But hardware improvement ultimately has a capacity limit. Moreover, many business partners directly operate the database when writing code. There have been many cases where the database was blown up as soon as the service went online. After the database is destroyed, there is no other chance to restore the business unless the database is restored.

If the data in the database is normal, the business can actually be compensated. So when we build the DAL service layer, the first thing to do is to limit the current flow, and other things can be put aside. Then for connection reuse, our Python framework uses a multi-process, single-thread and coroutine model.

In fact, a connection cannot be shared between multiple processes. For example: 10 Python processes are deployed on a machine, and each process has 10 database connections. Expanding to 10 machines, there are 1,000 database connections. For databases, connections are a very expensive thing, and our DAL layer needs to reuse connections.

This connection reuse is not about the connection reuse of the service itself, but the connection reuse on the DAL layer. That is, the service has 1000 connections to the DAL layer. After the connection reuse, the database may only maintain ten Several connections. Once it is discovered that a database request is a transaction, DAL will help you retain the corresponding relationship of this connection. When the transaction ends, the database connection is put back into the shared pool for use by others.

Then do smoke and fuse. The database can also be fusing. When the database smokes, we will kill some database requests to ensure that the database does not crash.

6. Service Governance

After the service framework, it involves issues of service governance. Service governance is actually a big concept. The first is to bury points. You need to bury a lot of monitoring points.

For example, if there is a request, whether the request is successful or failed, and what is the response time of the request, put all the monitoring indicators on the monitoring system. We have a large monitoring screen with many monitoring indicators. There is a dedicated team watching this screen 72 hours a day. If there is any fluctuation in the curve, someone will be found to solve it. The other is the alarm system. The things displayed on a monitoring screen are always limited and can only display those very important key indicators. At this time, an alarm system is needed.

Rome was not built in a day, and infrastructure is an evolutionary process. Our resources and time are always limited. As an architect and CTO, how can we produce more important things with such limited resources?

We have built many systems and feel that we have done a very good job, but in fact, we are not. I feel that we have returned to the Stone Age again, because there are more and more problems and more and more demands. It always feels like there is something wrong with your system. There are still some shortcomings, and there are a lot of functions that I would like to add.

For example, for the flow control system, we still need the user to configure a concurrency number. So, does this concurrency number not need to be configured by the user at all? Is it possible to automatically control the number of concurrency based on a state of our service itself?

Then there is the upgrade method. SDK upgrade is a very painful thing. For example, our service framework 2.0 was released in December last year, and some people are still using 1.0. Is it possible to achieve a lossless upgrade of the SDK? We can control the time and rhythm of the upgrade ourselves.

Also, our current monitoring only supports aggregation on the same service and is not divided into clusters or machines. Will future indicators be divided into clusters and machines? To give the simplest example, if there are 10 machines on a service, there may be a problem on only one machine, but all its indicators will be evenly distributed to the other 9 machines. You just see an increase in overall service latency, but it's possible that just one machine is slowing down the entire service cluster. But we are not yet able to monitor in more dimensions.

There is also an intelligent alarm. This alarm needs to be fast, complete and accurate. We can do it faster and more comprehensive now. How can we do it more accurately? During peak hours every day, more than 1,000 alarms are sent out every minute. Are all one thousand alarms useful? After calling the police too many times, it is equivalent to not calling the police at all. Everyone was tired and stopped watching. How can I distinguish this alarm more accurately? Is there more intelligent link analysis? In the future, should we not put monitoring indicators in our monitoring, but link analysis, so that we can clearly know which node the problem corresponds to.

These questions involve a principle of our work: as long as the things are enough, we must be able to prepare for rainy days and make certain advance planning.

The above is the detailed content of Architecture evolution and design exploration of Ele.me. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:linuxprobe.com. If there is any infringement, please contact admin@php.cn delete