Home >Backend Development >Python Tutorial >How does YouTube save huge video files?
Hello everyone, I am Bucai Chen~
YouTube is the second most popular website after Google. In May 2019, more than 500 hours of video content were uploaded to the platform every minute.
The video sharing platform has more than 2 billion users, with more than 1 billion hours of video played every day, generating billions of views. These are incredible numbers.
This article will provide an in-depth explanation of the database and back-end data infrastructure used by YouTube, which allows the video platform to store such huge amounts of data and scale to billions of users.
Then let’s get started.
The YouTube journey began in 2005. As the venture capital-funded technology startup continued to find success, it was acquired by Google in November 2006 for $1.65 billion.
Before being acquired by Google, their team consisted of the following people:
YouTube The backend microservices are written in Python, database, hardware, Java (using the Guice framework) and Go. The user interface is written using JavaScript.
The main database is MySQL supported by Vitess. Vitess is a database cluster system used for horizontal expansion of MySQL. In addition, use Memcache for caching and Zookeeper for node coordination.
Popular videos are served through a CDN, while general, less-played videos are fetched from a database.
When each video is uploaded, it will be given a unique identifier and will be processed by a batch job. This job will run multiple automated processes, such as generating thumbnails, metadata, and videos. Scripting, coding, setting monetization status, and more.
VP9 & H.264/MPEG-4 AVC Advanced Video Coding codecs are used for video compression and are capable of encoding HD and 4K quality video using half the bandwidth of other codecs.
Video streaming uses Dynamic Adaptive Streaming based on the HTTP protocol, which is an adaptive bitrate streaming technology that can achieve high-quality video streaming from a traditional HTTP web server. Video streaming. With this technology, content can be served to viewers at different bitrates. The YouTube client automatically adapts video rendering to the viewer's internet connection speed to minimize buffering times.
I once discussed YouTube's video transcoding process in a dedicated article, see "How YouTube provides high-quality videos with low latency".
So, here is a quick introduction to the back-end technology of the platform. The main database used by YouTube is MySQL. Now, let’s find out why YouTube’s engineering team felt the need to write Vitess? What problems did they face with their original MySQL environment that led them to implement an additional framework on top of it?
The website initially has only one database instance. As the website grows, developers have to horizontally expand the database in order to meet the increasing QPS (queries per second) requirements.
The replica will be added to the master database instance. Read requests are routed to the primary database and replicas to reduce the load on the primary database. Adding replicas helps alleviate bottlenecks, increase read throughput, and increase the durability of the system.
The master node handles write traffic, and the master node and replica node handle read traffic at the same time.
However, in this scenario, it is possible to read stale data from the replica. If a request reads the replica's data before the master updates the information to the replica, the viewer will get stale data.
At this time, the data of the primary node and the replica node are inconsistent. In this case, the inconsistent data is the number of views of a specific video on the primary and replica nodes.
Actually, this is no problem at all. Viewers won’t mind a slight inconsistency in view counts, right? What's more, the video can be rendered in their browser.
The data between the master node and the replica node will eventually be consistent.
So the engineers were very happy and the audience was also very happy. With the introduction of replicas, things are progressing smoothly.
The website continues to be popular and QPS continues to rise. The master-slave replica strategy is now having difficulty keeping up with the growth of website traffic.
What should we do now?
The next strategy is to shard the database. Sharding is one of the ways to extend relational databases in addition to master-slave replicas, master-master replicas, federations, and de-normalization.
Database sharding is not a simple process. It greatly increases the complexity of the system and makes management more difficult.
However, the database must be sharded to meet the growth of QPS. After developers shard the database, the data is spread across multiple machines. This increases the write throughput of the system. Now, instead of just one master instance handling writes, write operations can occur across multiple sharded machines.
At the same time, a separate copy is created for each machine for redundancy and throughput.
The popularity of the platform continues to rise, with large amounts of data being added to the database by content creators.
In order to prevent data loss or service unavailability caused by machine failure or unknown external events, it is necessary to add disaster management functions to the system.
Disaster management refers to emergency measures in the face of power outages and natural disasters (such as earthquakes and fires). It needs to be redundant and back up user data to data centers in different geographical areas of the world. Loss of user data or service unavailability is not permitted.
Having multiple data centers around the world also helps YouTube reduce system latency, as user requests are routed to the nearest data center instead of being routed to origin servers located on different continents.
Now, you can imagine how complex the infrastructure can become.
Often unoptimized full table scans cause the entire database to crash. Databases must be protected from bad queries. All servers need to be tracked to ensure efficient service.
Developers need a system that abstracts the complexity of the system, allows them to solve scalability challenges, and manages the system with minimal cost. All this led YouTube to develop Vitess.
Vitess is a database cluster system running on MySQL that enables MySQL to expand horizontally. It has built-in sharding features that allow developers to scale the database without having to add any sharding logic to the application. This is similar to what NoSQL does.
#Vitess also handles failover and backup automatically. It manages servers and improves database performance by intelligently rewriting resource-intensive queries and implementing caching. In addition to YouTube, the framework is also used by other well-known players in the industry, such as GitHub, Slack, Square, New Relic, etc.
Vitess comes into play when you need support for ACID transactions and strong consistency, and at the same time want to quickly scale a relational database like a NoSQL database.
At YouTube, each MySQL connection has a 2MB overhead. Each connection has a calculated cost, and as the number of connections increases, additional RAM must be added.
Vitess is able to manage these connections at a very low cost through a connection pool built on the Go programming language’s concurrency support. It uses Zookeeper to manage the cluster and keep it up to date.
Vitess is cloud native and is well suited for cloud deployment because, like the cloud model, capacity is gradually added to the database. It can run as a Kubernetes-aware, cloud-native distributed database.
At YouTube, Vitess runs in a containerized environment and uses Kubernetes as the container orchestration tool.
In today's computing era, every large-scale service runs in the cloud in a distributed environment. There are many benefits to running services in the cloud.
Google Cloud Platform is a set of cloud computing services based on the same infrastructure used by Google's internal end-user products such as Google Search and YouTube.
Every large-scale online service has a polyglot persistence architecture because no one data model, whether relational or NoSQL, can handle all usage scenarios of the service.
In research for this article, I was unable to find a list of specific Google Cloud databases used by YouTube, but I am pretty sure it uses GCP-specific products such as Google Cloud Spanner, Cloud SQL, Cloud Datastore , Memorystore, etc. to run different features of the service.
This article details the databases used by other Google services, such as Google Adwords, Google Finance, Google Trends, etc.
YouTube uses Google’s global network for low-latency, low-cost content delivery. With globally distributed POP edge points, it enables customers to obtain data faster without having to fetch it from the origin server.
So, so far, I have talked about the databases, frameworks and technologies used by YouTube. Now, it's time to talk about storage.
How does YouTube store such a huge amount of data (500 hours of video content uploaded every minute)?
Videos will be stored on hard drives in Google data centers. This data is managed by Google File System and BigTable.
GFS Google File System is a distributed file system developed by Google for managing large-scale data in distributed environments.
BigTable is a low-latency distributed data storage system built on the Google File System, used to process petabytes of data distributed across thousands of machines. It is used in more than 60 Google products.
Therefore, the video is stored on the hard drive. Relationships, metadata, user preferences, profile information, account settings, related data needed to get the video from storage, etc. are all stored in MySQL.
Google data centers have homogeneous hardware and software is built in-house , managing thousands of independent server clusters.
The servers deployed by Google can enhance the storage capabilities of the data center. They are all commercial servers (commodity servers), also known as commercial off-the-shelf servers (commercial off-the-shelf servers). These servers are low-priced, widely available and purchased in large quantities, and can replace or configure the same hardware in the data center at minimal cost and expense.
As the need for additional storage increases, new commodity servers will be plugged into the system.
After problems occur, commercial servers are often replaced instead of repaired. They are not custom-made, and using them allows businesses to reduce infrastructure costs to a significant extent compared to running custom-made servers.
YouTube requires over a petabyte of new storage every day. Spinning hard drives are the primary storage medium due to their low cost and high reliability.
SSD Solid-state drives have higher performance than spinning disks because they are based on semiconductors, but using them on a large scale is not cost-effective.
They are quite expensive and prone to losing data over time. This makes them unsuitable for storage of archived data.
In addition, Google is developing a new series of disks suitable for large-scale data centers.
There are five key metrics that can be used to judge the quality of hardware built for data storage:
The above is the detailed content of How does YouTube save huge video files?. For more information, please follow other related articles on the PHP Chinese website!