How does YouTube save huge video files?-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How does YouTube save huge video files?

PHPz

Apr 10, 2023 am 11:21 AM

documentvideostorage

Hello everyone, I am Bucai Chen~

YouTube is the second most popular website after Google. In May 2019, more than 500 hours of video content were uploaded to the platform every minute.

The video sharing platform has more than 2 billion users, with more than 1 billion hours of video played every day, generating billions of views. These are incredible numbers.

This article will provide an in-depth explanation of the database and back-end data infrastructure used by YouTube, which allows the video platform to store such huge amounts of data and scale to billions of users.

Then let’s get started.

1. Introduction

The YouTube journey began in 2005. As the venture capital-funded technology startup continued to find success, it was acquired by Google in November 2006 for $1.65 billion.

Before being acquired by Google, their team consisted of the following people:

Two system administrators
Two scalability software architects
Two Feature Developers
Two Network Engineers
One DBA

2. Backend Infrastructure

YouTube The backend microservices are written in Python, database, hardware, Java (using the Guice framework) and Go. The user interface is written using JavaScript.

The main database is MySQL supported by Vitess. Vitess is a database cluster system used for horizontal expansion of MySQL. In addition, use Memcache for caching and Zookeeper for node coordination.

How does YouTube save huge video files?

Popular videos are served through a CDN, while general, less-played videos are fetched from a database.

When each video is uploaded, it will be given a unique identifier and will be processed by a batch job. This job will run multiple automated processes, such as generating thumbnails, metadata, and videos. Scripting, coding, setting monetization status, and more.

VP9 & H.264/MPEG-4 AVC Advanced Video Coding codecs are used for video compression and are capable of encoding HD and 4K quality video using half the bandwidth of other codecs.

Video streaming uses Dynamic Adaptive Streaming based on the HTTP protocol, which is an adaptive bitrate streaming technology that can achieve high-quality video streaming from a traditional HTTP web server. Video streaming. With this technology, content can be served to viewers at different bitrates. The YouTube client automatically adapts video rendering to the viewer's internet connection speed to minimize buffering times.

I once discussed YouTube's video transcoding process in a dedicated article, see "How YouTube provides high-quality videos with low latency".

So, here is a quick introduction to the back-end technology of the platform. The main database used by YouTube is MySQL. Now, let’s find out why YouTube’s engineering team felt the need to write Vitess? What problems did they face with their original MySQL environment that led them to implement an additional framework on top of it?

3. Why you need Vitess

The website initially has only one database instance. As the website grows, developers have to horizontally expand the database in order to meet the increasing QPS (queries per second) requirements.

3.1 Master-Slave Replica

The replica will be added to the master database instance. Read requests are routed to the primary database and replicas to reduce the load on the primary database. Adding replicas helps alleviate bottlenecks, increase read throughput, and increase the durability of the system.

The master node handles write traffic, and the master node and replica node handle read traffic at the same time.

How does YouTube save huge video files?

However, in this scenario, it is possible to read stale data from the replica. If a request reads the replica's data before the master updates the information to the replica, the viewer will get stale data.

At this time, the data of the primary node and the replica node are inconsistent. In this case, the inconsistent data is the number of views of a specific video on the primary and replica nodes.

Actually, this is no problem at all. Viewers won’t mind a slight inconsistency in view counts, right? What's more, the video can be rendered in their browser.

The data between the master node and the replica node will eventually be consistent.

So the engineers were very happy and the audience was also very happy. With the introduction of replicas, things are progressing smoothly.

The website continues to be popular and QPS continues to rise. The master-slave replica strategy is now having difficulty keeping up with the growth of website traffic.

What should we do now?

3.2 Sharding

The next strategy is to shard the database. Sharding is one of the ways to extend relational databases in addition to master-slave replicas, master-master replicas, federations, and de-normalization.

Database sharding is not a simple process. It greatly increases the complexity of the system and makes management more difficult.

However, the database must be sharded to meet the growth of QPS. After developers shard the database, the data is spread across multiple machines. This increases the write throughput of the system. Now, instead of just one master instance handling writes, write operations can occur across multiple sharded machines.

At the same time, a separate copy is created for each machine for redundancy and throughput.

The popularity of the platform continues to rise, with large amounts of data being added to the database by content creators.

In order to prevent data loss or service unavailability caused by machine failure or unknown external events, it is necessary to add disaster management functions to the system.

3.3 Disaster Management

Disaster management refers to emergency measures in the face of power outages and natural disasters (such as earthquakes and fires). It needs to be redundant and back up user data to data centers in different geographical areas of the world. Loss of user data or service unavailability is not permitted.

Having multiple data centers around the world also helps YouTube reduce system latency, as user requests are routed to the nearest data center instead of being routed to origin servers located on different continents.

Now, you can imagine how complex the infrastructure can become.

Often unoptimized full table scans cause the entire database to crash. Databases must be protected from bad queries. All servers need to be tracked to ensure efficient service.

Developers need a system that abstracts the complexity of the system, allows them to solve scalability challenges, and manages the system with minimal cost. All this led YouTube to develop Vitess.

4.Vitess: A system for horizontal expansion of MySQL database cluster

Vitess is a database cluster system running on MySQL that enables MySQL to expand horizontally. It has built-in sharding features that allow developers to scale the database without having to add any sharding logic to the application. This is similar to what NoSQL does.

How does YouTube save huge video files?

#Vitess also handles failover and backup automatically. It manages servers and improves database performance by intelligently rewriting resource-intensive queries and implementing caching. In addition to YouTube, the framework is also used by other well-known players in the industry, such as GitHub, Slack, Square, New Relic, etc.

Vitess comes into play when you need support for ACID transactions and strong consistency, and at the same time want to quickly scale a relational database like a NoSQL database.

At YouTube, each MySQL connection has a 2MB overhead. Each connection has a calculated cost, and as the number of connections increases, additional RAM must be added.

Vitess is able to manage these connections at a very low cost through a connection pool built on the Go programming language’s concurrency support. It uses Zookeeper to manage the cluster and keep it up to date.

5. Deploy to the cloud

Vitess is cloud native and is well suited for cloud deployment because, like the cloud model, capacity is gradually added to the database. It can run as a Kubernetes-aware, cloud-native distributed database.

At YouTube, Vitess runs in a containerized environment and uses Kubernetes as the container orchestration tool.

In today's computing era, every large-scale service runs in the cloud in a distributed environment. There are many benefits to running services in the cloud.

Google Cloud Platform is a set of cloud computing services based on the same infrastructure used by Google's internal end-user products such as Google Search and YouTube.

Every large-scale online service has a polyglot persistence architecture because no one data model, whether relational or NoSQL, can handle all usage scenarios of the service.

In research for this article, I was unable to find a list of specific Google Cloud databases used by YouTube, but I am pretty sure it uses GCP-specific products such as Google Cloud Spanner, Cloud SQL, Cloud Datastore , Memorystore, etc. to run different features of the service.

This article details the databases used by other Google services, such as Google Adwords, Google Finance, Google Trends, etc.

6.CDN

YouTube uses Google’s global network for low-latency, low-cost content delivery. With globally distributed POP edge points, it enables customers to obtain data faster without having to fetch it from the origin server.

So, so far, I have talked about the databases, frameworks and technologies used by YouTube. Now, it's time to talk about storage.

How does YouTube store such a huge amount of data (500 hours of video content uploaded every minute)?

7. Data storage: How does YouTube store such a huge amount of data?

Videos will be stored on hard drives in Google data centers. This data is managed by Google File System and BigTable.

GFS Google File System is a distributed file system developed by Google for managing large-scale data in distributed environments.

BigTable is a low-latency distributed data storage system built on the Google File System, used to process petabytes of data distributed across thousands of machines. It is used in more than 60 Google products.

Therefore, the video is stored on the hard drive. Relationships, metadata, user preferences, profile information, account settings, related data needed to get the video from storage, etc. are all stored in MySQL.

How does YouTube save huge video files?

7.1 Plug-and-play commercial servers

Google data centers have homogeneous hardware and software is built in-house , managing thousands of independent server clusters.

The servers deployed by Google can enhance the storage capabilities of the data center. They are all commercial servers (commodity servers), also known as commercial off-the-shelf servers (commercial off-the-shelf servers). These servers are low-priced, widely available and purchased in large quantities, and can replace or configure the same hardware in the data center at minimal cost and expense.

As the need for additional storage increases, new commodity servers will be plugged into the system.

After problems occur, commercial servers are often replaced instead of repaired. They are not custom-made, and using them allows businesses to reduce infrastructure costs to a significant extent compared to running custom-made servers.

7.2 Storage Disks Designed for Data Centers

YouTube requires over a petabyte of new storage every day. Spinning hard drives are the primary storage medium due to their low cost and high reliability.

SSD Solid-state drives have higher performance than spinning disks because they are based on semiconductors, but using them on a large scale is not cost-effective.

They are quite expensive and prone to losing data over time. This makes them unsuitable for storage of archived data.

In addition, Google is developing a new series of disks suitable for large-scale data centers.

There are five key metrics that can be used to judge the quality of hardware built for data storage:

The hardware should be capable of supporting high-speed input and output operations on the order of seconds.
It should comply with the security standards specified by the organization.
It should have higher storage capacity compared to ordinary storage hardware.
Hardware purchase costs, electricity costs and maintenance costs should all be acceptable.
The disk should be reliable and latency stable.

The above is the detailed content of How does YouTube save huge video files?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python vs. C : Memory Management and ControlApr 19, 2025 am 12:17 AM

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python for Scientific Computing: A Detailed LookApr 19, 2025 am 12:15 AM

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Python and C : Finding the Right ToolApr 19, 2025 am 12:04 AM

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python for Data Science and Machine LearningApr 19, 2025 am 12:02 AM

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

See all articles