search
HomeDatabaseMysql TutorialBetting the Farm on MongoDB

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudls CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB Hudl helps coaches win. We give sports teams from peewee to the pros onli

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudl’s CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB

Hudl helps coaches win. We give sports teams from peewee to the pros online tools to make working with and analyzing video easy. Today we store well over 600 million video clips in MongoDB spread across seven shards. Our clips dataset has grown to over 350GB of data with over 70GB of indexes. From our first year of a dozen beta high schools we’ve grown to service the video needs of over 50,000 sports teams worldwide.

Why MongoDB

When we began hacking away on Hudl we chose SQL Server as our database. Our backend is written primarily in C#, so it was a natural choice. After a few years and solid company growth we realized SQL Server was quickly becoming a bottleneck. Because we run in EC2, vertically scaling our DB was not a great option. That’s when we began to look at NoSQL seriously and specifically MongoDB. We wanted something that was fast, flexible and developer-friendly.

After comparing a few alternative NoSQL databases and running our own benchmarks, we settled on MongoDB. Then came the task of moving our existing data from SQL Server to MongoDB. Video clips were not only our biggest dataset, it was also our most frequently-accessed data. During our busy season we average 75 clip views per second but peak at over 800 per second. We wanted to migrate the dataset with zero downtime and zero data loss. We also wanted to have fail-safes ready during each step of the process so we could recover immediately from any unanticipated problems during the migration.

In this post we’ll take a look at our schema design choices, our migration plan and the performance we’ve seen with MongoDB.

Schema Design

In SQL Server we normalized our data model. Pulling together data from multiple tables is SQL’s bread-and-butter. In the NoSQL world joins are not an option and we knew that simply moving the SQL tables directly over to MongoDB and doing joins in code was a bad idea. So, we looked at how our application interacted with SQL and created an optimized schema in MongoDB.

Before I get into the schema we chose, I’ll try to provide context to Hudl’s product. Below is a screenshot of our ‘Library’ page. This is where coaches spend much of their time reviewing and analyzing video.

You see above a video playing and a kind of spreadsheet underneath. The video represents one angle of one clip (many of our teams film two or three angles each game). The spreadsheet contains rows of clips and columns of breakdown data. The breakdown data gives context to what happened in the clip. For example, the second clip was a defensive play from the 30 yard line. It was first and ten and was a run play to the left. This breakdown data is incredibly important for coaches to spot patterns and trends in their opponents play (as well as make sure they don’t have an obvious patterns that could be used against them).

When we translated this schema to MongoDB we wanted to optimize for the most-common operations. Watching video clips and editing clip metadata are our two highest frequency operations. To maximize performance we made a few important decisions.

  1. We chose to encapsulate an entire clip per document. Watching a clip would involve a single document lookup. Because MongoDB stores each document contiguously on disk, it would minimize the number of disk seeks when fetching a clip not in memory, which means faster clip loads.
  2. We denormalized our column names to speed up both writes and reads. Writes are faster because we no longer have to lookup or track Column IDs. A write operation is as simple as:
    db.clips.update({teamId:205, _id:123}, 
    {$set: {'data.PLAY TYPE':'Pass'}}) 
    Reads are also faster because we no longer have to join on the ClipDataColumn table to get the column names. This comes at a cost of greater storage and memory requirements as we store the same column names in multiple documents. Despite that, we felt the performance benefits were worth the cost.

One of the most important considerations when designing a schema in MongoDB is choosing a shard key. Have a good shard key is critical for effective horizontal scaling. Data is stored in shards (each shard is a replica set) and we can add new shards easily as our dataset grows. Replica sets don’t need to know about each other, they are only concerned with their own data. The MongoDB Router (mongos) is the piece that sees the whole picture. It knows which shard houses each document.

When you perform a query against a sharded collection, the shard key is not required. However, there is a cost penalty for not providing the shard key. The key is used to know which shard contains the answer to your query. Without it, the query has to be sent to all shards in your cluster. To illustrate this, I’ve got a four shard cluster. The shard key is TeamId (the property is named ‘t’), and you can see that clips belonging to teams 1-100 live on Shard 1, 101-200 live on Shard 2, etc. Given the query to find clip ‘123’, only Shard 3 will respond with results, but Shards 1, 2 and 4 must also process and execute the query. This is known as a scatter/gather query. In low volume this is ok, but you won’t see the benefits of horizontal scalability if every query has to be sent to all shards. Only when the shard key is provided can the query be sent directly to Shard 3. This is known as a targeted query.

For our Clips collection, we chose TeamId as our shard key. We looked at a few different possible shard keys:

  1. We considered sharding by clipId (_id) but decided against it because we let coaches organize clips into playlists (similar to a song playlist in iTunes or Spotify). While queries to all clips in a playlist are less common than grabbing an individual clip, they are common enough that we wanted it to use a targeted query.
  2. We also considered sharding by the playlist Id, but we wanted the ability for clips to be a part of multiple playlists. The shard key, once set, is immutable. Clips can be added or removed from playlists at any time.
  3. We finally settled on TeamId. TeamId is easily available to us when making the vast majority of our queries to the Clips collection. Only for a few infrequent operations would we need to use scatter/gather queries.

The Transition

As I mentioned, we needed to transition from SQL Server to MongoDB with zero downtime. In case anything went wrong, we needed fallbacks and fail-safes along the way. Our approach was two-fold. In the background we ran a process that ‘fork-lifted’ data from SQL Server to MongoDB. While that ran in the background, we created a multiplexed DAO (data access object, our db abstraction layer) that would only read from SQL but would write to both SQL and MongoDB. That allowed us to batch-move all clips without having to worry about stale data. Once the two databases were completely synced up, we switched over to perform all reads from MongoDB. We continued to dual-write so we could easily switch back to SQL Server if problems arose. After we felt confident in our MongoDB solution, we pulled the plug on SQL Server.

In step one we took a look at how we read and wrote clip data. That let us design an optimal MongoDB schema. We then refactored our existing database abstraction layer to use data-structures that matched the MongoDB schema. This gave us a chance to prove out the schema ahead of time.

Next we began sending write operations to both SQL and MongoDB. This was an important step because it allowed our data fork-lifting process work through all clips one after another while protecting us from data corruption.

The data fork-lifting process took about a week to complete. The time was due to both the large size of the dataset and our own throttling logic. We throttled the rate of data migration to minimize the impact on normal operations. We didn’t want coaches to feel any pain during this migration.

After the data fork-lift was complete we began the process of reading from MongoDB. We built in the ability to progressively send more and more read traffic to MongoDB. That allowed us to gain confidence in our code and the MongoDB cluster without having to switch all-at-once. After a while with dual writes but all MongoDB reads, we turned off dual writes and dropped the tables in SQL Server. It was both a scary moment (sure, we had backups… but still!) and very satisfying. Our SQL database size was reduced by over 80GB. Of that total amount, 20GB was index data, which means our memory footprint was also greatly reduced.

Performance

We have been thrilled with the performance of MongoDB. MongoDB exceeded our average performance goal of 100ms and, just as important, is consistently performant. While it’s good to keep an eye on average times, it’s more important to watch the 90th and 99th percentile performance metrics. With MongoDB our average clip load time is around 18ms and our 99th percentile times are typically at or under 100ms.

Clip load times during the same time period during season

Conclusion

Our transition from SQL Server to MongoDB started with our largest and most critical dataset. After having gone through it, we are very happy with the performance and scalability of MongoDB and appreciate how developer-friendly it is to work with. Moving from a relational to a NoSQL database naturally has a learning curve. Now that we are over it we feel very good about our ability to scale well into the future. Perhaps most telling of all, most new feature development at Hudl is done in MongoDB. We feel MongoDB lets us focus more on writing features to help coaches win and less time crafting database scripts.

Sign up for the MongoDB Newsletter to get MongoDB updates right to your inbox

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
2 个月不见,人形机器人 Walker S 会叠衣服了2 个月不见,人形机器人 Walker S 会叠衣服了Apr 03, 2024 am 08:01 AM

机器之能报道编辑:吴昕国内版的人形机器人+大模型组队,首次完成叠衣服这类复杂柔性材料的操作任务。随着融合了OpenAI多模态大模型的Figure01揭开神秘面纱,国内同行的相关进展一直备受关注。就在昨天,国内"人形机器人第一股"优必选发布了人形机器人WalkerS深入融合百度文心大模型后的首个Demo,展示了一些有趣的新功能。现在,得到百度文心大模型能力加持的WalkerS是这个样子的。和Figure01一样,WalkerS没有走动,而是站在桌子后面完成一系列任务。它可以听从人类的命令,折叠衣物

mongodb php 扩展没有怎么办mongodb php 扩展没有怎么办Nov 06, 2022 am 09:10 AM

mongodb php扩展没有的解决办法:1、在linux中执行“$ sudo pecl install mongo”命令来安装MongoDB的PHP扩展驱动;2、在window中,下载php mongodb驱动二进制包,然后在“php.ini”文件中配置“extension=php_mongo.dll”即可。

Redis和MongoDB的区别与使用场景Redis和MongoDB的区别与使用场景May 11, 2023 am 08:22 AM

Redis和MongoDB都是流行的开源NoSQL数据库,但它们的设计理念和使用场景有所不同。本文将重点介绍Redis和MongoDB的区别和使用场景。Redis和MongoDB简介Redis是一个高性能的数据存储系统,常被用作缓存和消息中间件。Redis以内存为主要存储介质,但它也支持将数据持久化到磁盘上。Redis是一款键值数据库,它支持多种数据结构(例

php7.0怎么安装mongo扩展php7.0怎么安装mongo扩展Nov 21, 2022 am 10:25 AM

php7.0安装mongo扩展的方法:1、创建mongodb用户组和用户;2、下载mongodb源码包,并将源码包放到“/usr/local/src/”目录下;3、进入“src/”目录;4、解压源码包;5、创建mongodb文件目录;6、将文件复制到“mongodb/”目录;7、创建mongodb配置文件并修改配置即可。

php怎么使用mongodb进行增删查改操作php怎么使用mongodb进行增删查改操作Mar 28, 2023 pm 03:00 PM

MongoDB作为一款流行的NoSQL数据库,已经被广泛应用于各种大型Web应用和企业级应用中。而PHP语言也作为一种流行的Web编程语言,与MongoDB的结合也变得越来越重要。在本文中,我们将会学习如何使用PHP语言操作MongoDB数据库进行增删查改的操作。

SpringBoot中logback日志怎么保存到mongoDBSpringBoot中logback日志怎么保存到mongoDBMay 18, 2023 pm 07:01 PM

自定义Appender非常简单,继承一下AppenderBase类即可。可以看到有个AppenderBase,有个UnsynchronizedAppenderBase,还有个AsyncAppenderBase继承了UnsynchronizedAppenderBase。从名字就能看出来区别,异步的、普通的、不加锁的。我们定义一个MongoDBAppender继承UnsynchronizedAppenderBasepublicclassMongoDBAppenderextendsUnsynchron

SpringBoot怎么整合Mongodb实现增删查改SpringBoot怎么整合Mongodb实现增删查改May 13, 2023 pm 02:07 PM

一、什么是MongoDBMongoDB与我们之前熟知的关系型数据库(MySQL、Oracle)不同,MongoDB是一个文档数据库,它具有所需的可伸缩性和灵活性,以及所需的查询和索引。MongoDB将数据存储在灵活的、类似JSON的文档中,这意味着文档的字段可能因文档而异,数据结构也会随着时间的推移而改变。文档模型映射到应用程序代码中的对象,使数据易于处理。MongoDB是一个以分布式数据库为核心的数据库,因此高可用性、横向扩展和地理分布是内置的,并且易于使用。况且,MongoDB是免费的,开源

PHP实现MongoDB数据库可用性的方法PHP实现MongoDB数据库可用性的方法May 16, 2023 am 10:01 AM

随着互联网技术的不断发展,大数据成为企业发展的重要资产。而对于企业来说,数据的可用性和安全性至关重要。MongoDB是一个高性能、高可用性的NoSQL数据库,越来越受到企业的青睐。然而,MongoDB的可用性也是企业关注的焦点之一,本文将介绍PHP实现MongoDB数据库可用性的方法。一、了解MongoDB的高可用性特性MongoDB作为NoSQL数据库,具

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment