Home >Operation and Maintenance >Safety >How to practice the 'four kings' of self-revolution
Operation and maintenance 100 forums, through interviews and manuscript invitations, invite veterans in the field of operation and maintenance to provide profound insights and collide together with a view to forming Some advanced consensuses promote the industry to move forward better.
In this issue, we invited Wang Mingsong. Boss Wang put forward the "Four Rules" for cloud native application practice, which is widely recognized in the industry. Starting in 2019, all the IDC business of Boss Wang's company has been moved to the cloud. The scale is not small, but the SRE team is very small, a bit like NetFlix. In this lecture, let’s take a look at how senior cloud operations and maintenance work.
This is the 7th issue of the down-to-earth and high-level "Operation and Maintenance Forum", let's start!
Before we begin, please let Boss Wang introduce himself. Let’s talk about your work history, especially your experience using the cloud, and give you some background information.
Around 2005, I did BBS operation and maintenance in school, which was considered an introduction. After graduation, I joined a major Internet company (Editor's note: referring to Baidu) that is now in decline, starting from P1 level operation and maintenance across industries. In 2010, I ran away and joined a mobile Internet start-up company. At that time, I basically did everything from system network cabling to computer room IT. The server procurement cycle was a little long even for small companies, so I started to consider using the cloud at that time.
Since 2011, I have used Suguang Cloud for a while, which is based on vmware. The experience is very poor. From my personal point of view, the usability and economy are not good. The only thing is that it may be faster to install the machine than IDC. . Then the network is also weird, causing a lot of trouble. At the same time, I also used Shanda Cloud for a period of time. The experience was better than Sugon, but it was actually at the level of VPS. It feels like the vpc layer has not been done. I didn’t dare to put too important resources on it, and then I stopped using them after repeated pulls (maybe it’s because I didn’t use them in the right way and it wasn’t easy to monitor).
I started using ucloud in 2013. This mainly uses virtual machines and not much else. But the vpc product should have been available at that time, and some important businesses would have been moved up.
In 2014, I started to use AWS because I started to do overseas business. In 2019, all IDC businesses were migrated to the cloud.
I first met Boss Wang because of a discussion in the WeChat group. Boss Wang proposed four cloud native application practices and believed that as long as these four are implemented , the application is basically cloud-native. The group members deeply agree and named it "Wang Si Tiao". Can Boss Wang please share the essence of "Wang Si Tiao" with SRETalk readers?
I have put the detailed version of the four cloud native kings in the Swedish Ma Gong repo (https://github.com/lipingtababa/cloud-native-best-practices). Welcome everyone to raise issues, I will also update the four cloud native kings from time to time.
The brief version is:
The starting point of these four items is actually basically around the statelessness and data of the application It can be done safely while taking into account cost, performance and reliability. The scope of application is not limited to cloud computing. Traditional IDC can also be implemented as a reference.
Editor's note: This simplified version may not look like much, but it actually contains a lot of things. I suggest you read it. If you can't click on the link "Cloud Native King Si Tiao", go to the repo above and find Cloud Native there. Wang Si Tiao.md is enough.
The "Four Kings" lists some best practices, which require the cooperation of R&D. When implemented within the company, I wonder if there will be obstacles? How did you settle it?
I encountered almost no obstacles, but that was because we had our own circumstances.
On the one hand, we had no choice but to go to the cloud, and cost control was a hard target, so there was no other appeasement route to choose from.
We are a new company spun off from the team, so we only gave one year to make the transition. The goal given by the management is to make the existing thousands of machines run smoothly. Profitable businesses are migrated seamlessly. Because we were only doing overseas business at the time, we didn't consider non-cloud solutions at all. However, the management still required that the cloud costs would be lower than using IDC before.
If the original architecture is directly moved to the cloud, the cost target set by the management will definitely not be achieved (Boss Feng of this pigsty has written many similar articles to support the traditional IDC’s view of the cloud. cost advantage), so there was only one choice at that time: to transform the existing architecture to adapt to the cloud, so that after migration, the goals of cost, performance, and stability can be achieved.
On the other hand, let R&D fully participate in model selection and cost optimization, and everyone can reach a consensus.
I spent about a year in advance to start selecting public clouds, and specifically participated in training to learn how to better use the cloud, and gradually formed my own methodology. Before the migration, I also led the key members of R&D to participate in relevant training. After the training, they were able to understand that many of my practices were correct. In addition, during the actual migration, AWS also provided a more professional solution design. Therefore, it is relatively easy to implement the content of the "four kings". For example:
1. It is very expensive to store data in EBS, so storing data in S3 is a very economical choice. Through training and comparison of various solutions, R&D has made this very clear. situation, so there will be a greater willingness to make program modifications.
2, Role is a security requirement, because the AWS SDK supports it very well. There is no difficulty difference between using Role or ak sk when you first get started. Control it from the beginning. It doesn’t matter for research and development. question.
3. Regarding hosting services, R&D actually doesn’t care whether it’s operation and maintenance or using existing services. As long as our operation and maintenance can let go of our obsession, this will be enough.
4. The data should not be stored on the server. In fact, we have gone through a relatively big run-in.
Our migration this time is from an IDC environment with complete platform support to AWS. With the help of AWS partners, the new architecture is designed in accordance with AWS best practices and satisfies previous usage habits. and requirements.
But because of the reconstruction, there are still differences in usage. Because ASG is used, the server is directly killed during shrinkage or fault migration. If persistent data needs to be stored on it, it will be gone. So after this time, R&D can basically accept that the online business data does not exist on the server.
And because of this design, our requirements for server storage can be as small as possible. Anything over 100G requires my approval. Saved a lot of EBS costs
Later, when the R&D team was deploying K8S, they had a deeper understanding of this. After all, the data in the container will be lost.
Recently, there are some articles about how they comprehensively measure ROI and think it is more cost-effective to move to the cloud, such as the article by the father of RoR, and Mr. Zou of Tuyou Game in the last issue of the Operation and Maintenance Baijia Forum. It seems that You are more inclined to use the cloud in depth. Can you share your thoughts with everyone?
I have actually been advocating "best practices", but I have also communicated with everyone that "best practices are the emperor's art of investors or management", use best practices It is very likely that you have to sacrifice your own and many other people's jobs to achieve the optimum. If you can achieve the optimal without destroying your jobs, your choices will be more diverse.
Whether to move to the cloud or move to the cloud depends on your interests, the strength of your management’s support, and your historical baggage. If I were in the position of Mr. Zou or DHH, I might not stick to my current views. I can stick to the cloud:
On the one hand, it is the recognition of the management. The management has suffered the losses of idle assets. I have been doing the optimization of idle IDC resources for a long time, so I added the self-built overseas The computer room is not particularly easy either. Going to the cloud is basically the only solution supported by the management. On the other hand, as mentioned above, by chance, our architecture has been completely transformed, and the transformation cost is supported by the management, so we can make full use of the advantages of the cloud.Finally, our business model does not yet have a long-term stable high-load and stateless business. This kind of business is more suitable for traditional IDC.
I believe that the cost for Mr. Zou or DHH to transform their existing system architecture is too high. Even if it can reduce the labor cost of the operation and maintenance department, it may be difficult to get support, because this still has Involves the interests of other departments.
But if it is a new company and a new project, I believe there is no more suitable scenario than the cloud. Choose a suitable cloud vendor and use a cloud-native architecture to implement the business, so that the entire business can be improved in terms of performance and cost. is elastic.
Many friends complained about cloud killing pigs, locking and the like. But from the perspective of investors or management, all elements are to achieve business profitability. People/cloud/IDC are all elements to achieve business. If investors want to achieve business, they must not only pay for these elements, but also Being able to obtain elements that meet your needs in a timely manner (this is more important). Obtaining the element of cloud couldn't be easier. The product quality and price are relatively standard. You can pay for it with a few clicks. You can pay for it on demand, but you can stop using it at any time. But what about people? It is difficult to obtain people, and the quality is difficult to determine. It is not standardized, and there will be price fluctuations (salary increases). You cannot be laid off casually, and you will not be replaced by someone who is absolutely the same when you leave the job. People can be very creative, but when it comes to standardization and mechanical boring things, people can never be the opponent of machines, let alone SaaS services.
As for Mr. Zou’s situation, if their business team is unwilling to transform the program architecture, his current choice is the best practice: for stable and high-load business, choose an IDC with a cost advantage, and rent machines instead of Purchase; Elastic business migration to the cloud.
For 37signals' Basecamp, the pricing model setting of the product determines that it is a bit troublesome for them to migrate to the cloud. Most SaaS services now are paid based on usage or number of users, but Basecamp mainly sells unlimited packages, which are only $199 a month. This pricing model means that they cannot fully utilize the elasticity of the cloud to make profits, and can only overbook low-priced resources. If this pricing model is not changed, no matter how the architecture is optimized, it may not be suitable for the cloud.
There was a recent article "The future of operation and maintenance is platform engineering". Do you agree with this view? What roles and boundaries does your team have in platform engineering? How do you plan for so-called platform engineering (especially in a multi-cloud environment)?
Is it written by Ruan Yifeng or Charity Majors? But I haven’t read these two articles before, and I just took a brief look at them. I don't fully agree with this, and I personally wouldn't try to do internal platform engineering.
First of all, let’s talk about what I don’t agree with: that article has some misunderstandings about concepts.
First of all, DevOps is not a position. I have tried to understand it for a long time, and the final feeling is that it is a development model. But the core of this development model is R&D, and all elements must focus on efficient R&D iteration services. The article initially believed that DevOps was a position, but later believed that this position was for business development. I think these are inappropriate understandings.
Secondly, operation and maintenance has a lot to explore in the future. Transformation is not a new topic. Everyone has long understood that the operation and maintenance industry is a sunset industry. In the past ten years or so, many operations and maintenance have been trying to transform and find a way out for the next step. Some are trying to engage in CI/CD, and some are trying to engage in CI/CD. Some are trying to do monitoring research and development, some are trying to develop automated operation and maintenance platforms, some are trying to engage in new fields (such as K8s, big data, AI, cloud computing, etc.), and some are trying to move to other sub-projects (such as DBA, network security) ).
It can be seen that many of these transformations serve the DevOps development model.
Platform engineering may be an implementation model, but with the product strength and R&D level of the operation and maintenance group, I am afraid that doing platform engineering on my own can only amuse myself, and even the stability cannot be guaranteed, which will only increase the burden. Pot maybe. However, if a more professional production and research team is introduced to do it, on the one hand, it will be hard to get support if it is not doing business properly and has nothing to do with the main business income. On the other hand, it is just a platform for self-use. It is not economical to recruit so many people to make a product with no income, and it is even more difficult to get support. What's more, this approach has no sense of participation in the existing operation and maintenance, and it cannot be regarded as transformation.
So, I think the right approach is to use mature platforms and tools (open source/paid self-built, SaaS can be used). You can do some customization and secondary development based on these platforms, but don't reinvent the wheel.
Finally, my understanding of the platform in that article is also different.
On the one hand, the platform itself can be provided in the form of SaaS, and there is no need for secondary integration. The main reason now is that the domestic SaaS environment is not good, and software services do not pay attention to mutual integration and compatibility, but prefer to grow bigger. And all. When we look overseas, we will find that there are many SaaS or software in niche fields overseas, which are very good and can be integrated with other software. The ecosystem is very good, so the integration is easy to configure, and there is not much workload for secondary development.
On the other hand, the users of the platform should be R&D, and R&D should be able to use it directly without the need for operation and maintenance to convey or approve it.
So in the future we really need to use a platform. It is a platform made by professional production and research teams, not toys made by ourselves; it is a platform that the production and research teams use directly, rather than operation and maintenance in the middle. Be a transmitter.
So for platform engineering, I choose to actively use mature software or SaaS services, and provide them for direct use by the production and research team as much as possible.
Operation and maintenance only make some necessary checkpoints based on cost and security, and control them through policies, permissions, and audits to ensure that the production and research team can use them correctly.
Under Boss Wang’s working model, I feel that only very senior people are needed. The fresh blood is too young to take on the role of R&D coach, but without fresh blood, it cannot be long-term. Weiji, can you share how you build your echelon?
This is a good question because I haven’t solved it either. This is not a problem with this working model.
Many companies and many types of jobs have a demand for senior talents, and they all face the same problem I have now. What type of work does not require senior talents? I think the work content has been very standardized, the company's requirements are not high, and anyone can give clear instructions according to the needs and do it well. Even machines can do it.
Mr. Zou has a saying that traditional operation and maintenance is similar to cleaning. The work content is important but the value is not high. I quite agree with this statement. This is the dilemma we are facing in operation and maintenance now. So does the cleaning team develop their own cleaning tools or do they purchase them?
Because I use a large number of mature products and external services, I can output cleaning tasks more stably, just like cleaning using various automatic and semi-automatic cleaning tools. But you don’t need to worry about a person’s lack of cleaning ability leading to unclean mopping, or lack of professionalism leading to a simple scan and then handing over the job. Although cleaning requires a little more learning and difficulty to operate these tools well than traditional tools, the overall SOP is less than before because mature tools shield the details.
So, we should not waste time on low-value work content. This type of work should be completed with professional software or SaaS. They have economies of scale, good functions and SLAs. We should focus our work on areas that business, management, and investors are more concerned about.
We know that Boss Wang has always been an advocate of "operation and maintenance self-revolution", which is "anti-human". Can you talk about the thinking behind this?
The fact we see now is that operation and maintenance is not a booming industry. Most enterprises do not have a huge operation and maintenance department to support the operation of the enterprise's system. It may even only require one person, who will also be responsible for IT, network management, security and other tasks. We have no room for improvement. There are very few operation and maintenance directors, and the operation and maintenance managers are basically the limit. With the number of people I manage now, you can call me an operation and maintenance director.
The industry is also in the same state now. A large number of training courses provide quick operation and maintenance, which is sufficient and cheap. There are very few mid-to-high-end operations and maintenance. Operations and maintenance are not like network engineers or DBAs. Our technology stack is very complicated and there is no authoritative certification to mark our capabilities. This is not conducive to our planning of career paths and the formation of healthy talents. market. Therefore, the market's positioning of our operation and maintenance is actually a miscellaneous job. "That technology that does not write business code" may be our most accurate positioning.
According to the concept of DevOps, we should speed up business delivery and provide good services, rather than adding chaos to production and research. But the meaning and work of operation and maintenance are not just about DevOps. This is where my views differ from many others.
On the one hand, operation and maintenance is the "watchdog" of the company's digital assets. From this perspective, operation and maintenance represents the interests of management and investors, properly safeguards the company's digital assets, and ensures It can be used correctly, meet various regulatory requirements, and participate in various internal audits. It is the management's check and balance on the production and research team. This is actually the meaning of initial operation and maintenance.
On the other hand, the country appreciates the food. Regulatory requirements are becoming increasingly stringent. Whether it is network security, data security or personal information protection, dedicated personnel are required to be responsible for related work. For small-scale enterprises, these tasks must be performed concurrently by operation and maintenance, especially data security. The operation and maintenance of digital assets directly in charge must be involved. This is the requirement for operation and maintenance in the new era.
So if you want to understand this, you will find that Devops and platforms are all a small part of the operation and maintenance work. We should liberate ourselves from these entanglements, untie ourselves, and give The production and research teams are unbundled and do a good job in our management and supervision perspective.
The above is the detailed content of How to practice the 'four kings' of self-revolution. For more information, please follow other related articles on the PHP Chinese website!