搜索
首页数据库mysql教程Betting the Farm on MongoDB

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudls CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB Hudl helps coaches win. We give sports teams from peewee to the pros onli

This is a guest post by Jon Dokulil, VP of Engineering at Hudl. Hudl’s CTO, Brian Kaiser, will be speaking at MongoDB World about migrating from SQL Server to MongoDB

Hudl helps coaches win. We give sports teams from peewee to the pros online tools to make working with and analyzing video easy. Today we store well over 600 million video clips in MongoDB spread across seven shards. Our clips dataset has grown to over 350GB of data with over 70GB of indexes. From our first year of a dozen beta high schools we’ve grown to service the video needs of over 50,000 sports teams worldwide.

Why MongoDB

When we began hacking away on Hudl we chose SQL Server as our database. Our backend is written primarily in C#, so it was a natural choice. After a few years and solid company growth we realized SQL Server was quickly becoming a bottleneck. Because we run in EC2, vertically scaling our DB was not a great option. That’s when we began to look at NoSQL seriously and specifically MongoDB. We wanted something that was fast, flexible and developer-friendly.

After comparing a few alternative NoSQL databases and running our own benchmarks, we settled on MongoDB. Then came the task of moving our existing data from SQL Server to MongoDB. Video clips were not only our biggest dataset, it was also our most frequently-accessed data. During our busy season we average 75 clip views per second but peak at over 800 per second. We wanted to migrate the dataset with zero downtime and zero data loss. We also wanted to have fail-safes ready during each step of the process so we could recover immediately from any unanticipated problems during the migration.

In this post we’ll take a look at our schema design choices, our migration plan and the performance we’ve seen with MongoDB.

Schema Design

In SQL Server we normalized our data model. Pulling together data from multiple tables is SQL’s bread-and-butter. In the NoSQL world joins are not an option and we knew that simply moving the SQL tables directly over to MongoDB and doing joins in code was a bad idea. So, we looked at how our application interacted with SQL and created an optimized schema in MongoDB.

Before I get into the schema we chose, I’ll try to provide context to Hudl’s product. Below is a screenshot of our ‘Library’ page. This is where coaches spend much of their time reviewing and analyzing video.

You see above a video playing and a kind of spreadsheet underneath. The video represents one angle of one clip (many of our teams film two or three angles each game). The spreadsheet contains rows of clips and columns of breakdown data. The breakdown data gives context to what happened in the clip. For example, the second clip was a defensive play from the 30 yard line. It was first and ten and was a run play to the left. This breakdown data is incredibly important for coaches to spot patterns and trends in their opponents play (as well as make sure they don’t have an obvious patterns that could be used against them).

When we translated this schema to MongoDB we wanted to optimize for the most-common operations. Watching video clips and editing clip metadata are our two highest frequency operations. To maximize performance we made a few important decisions.

  1. We chose to encapsulate an entire clip per document. Watching a clip would involve a single document lookup. Because MongoDB stores each document contiguously on disk, it would minimize the number of disk seeks when fetching a clip not in memory, which means faster clip loads.
  2. We denormalized our column names to speed up both writes and reads. Writes are faster because we no longer have to lookup or track Column IDs. A write operation is as simple as:
    db.clips.update({teamId:205, _id:123}, 
    {$set: {'data.PLAY TYPE':'Pass'}}) 
    Reads are also faster because we no longer have to join on the ClipDataColumn table to get the column names. This comes at a cost of greater storage and memory requirements as we store the same column names in multiple documents. Despite that, we felt the performance benefits were worth the cost.

One of the most important considerations when designing a schema in MongoDB is choosing a shard key. Have a good shard key is critical for effective horizontal scaling. Data is stored in shards (each shard is a replica set) and we can add new shards easily as our dataset grows. Replica sets don’t need to know about each other, they are only concerned with their own data. The MongoDB Router (mongos) is the piece that sees the whole picture. It knows which shard houses each document.

When you perform a query against a sharded collection, the shard key is not required. However, there is a cost penalty for not providing the shard key. The key is used to know which shard contains the answer to your query. Without it, the query has to be sent to all shards in your cluster. To illustrate this, I’ve got a four shard cluster. The shard key is TeamId (the property is named ‘t’), and you can see that clips belonging to teams 1-100 live on Shard 1, 101-200 live on Shard 2, etc. Given the query to find clip ‘123’, only Shard 3 will respond with results, but Shards 1, 2 and 4 must also process and execute the query. This is known as a scatter/gather query. In low volume this is ok, but you won’t see the benefits of horizontal scalability if every query has to be sent to all shards. Only when the shard key is provided can the query be sent directly to Shard 3. This is known as a targeted query.

For our Clips collection, we chose TeamId as our shard key. We looked at a few different possible shard keys:

  1. We considered sharding by clipId (_id) but decided against it because we let coaches organize clips into playlists (similar to a song playlist in iTunes or Spotify). While queries to all clips in a playlist are less common than grabbing an individual clip, they are common enough that we wanted it to use a targeted query.
  2. We also considered sharding by the playlist Id, but we wanted the ability for clips to be a part of multiple playlists. The shard key, once set, is immutable. Clips can be added or removed from playlists at any time.
  3. We finally settled on TeamId. TeamId is easily available to us when making the vast majority of our queries to the Clips collection. Only for a few infrequent operations would we need to use scatter/gather queries.

The Transition

As I mentioned, we needed to transition from SQL Server to MongoDB with zero downtime. In case anything went wrong, we needed fallbacks and fail-safes along the way. Our approach was two-fold. In the background we ran a process that ‘fork-lifted’ data from SQL Server to MongoDB. While that ran in the background, we created a multiplexed DAO (data access object, our db abstraction layer) that would only read from SQL but would write to both SQL and MongoDB. That allowed us to batch-move all clips without having to worry about stale data. Once the two databases were completely synced up, we switched over to perform all reads from MongoDB. We continued to dual-write so we could easily switch back to SQL Server if problems arose. After we felt confident in our MongoDB solution, we pulled the plug on SQL Server.

In step one we took a look at how we read and wrote clip data. That let us design an optimal MongoDB schema. We then refactored our existing database abstraction layer to use data-structures that matched the MongoDB schema. This gave us a chance to prove out the schema ahead of time.

Next we began sending write operations to both SQL and MongoDB. This was an important step because it allowed our data fork-lifting process work through all clips one after another while protecting us from data corruption.

The data fork-lifting process took about a week to complete. The time was due to both the large size of the dataset and our own throttling logic. We throttled the rate of data migration to minimize the impact on normal operations. We didn’t want coaches to feel any pain during this migration.

After the data fork-lift was complete we began the process of reading from MongoDB. We built in the ability to progressively send more and more read traffic to MongoDB. That allowed us to gain confidence in our code and the MongoDB cluster without having to switch all-at-once. After a while with dual writes but all MongoDB reads, we turned off dual writes and dropped the tables in SQL Server. It was both a scary moment (sure, we had backups… but still!) and very satisfying. Our SQL database size was reduced by over 80GB. Of that total amount, 20GB was index data, which means our memory footprint was also greatly reduced.

Performance

We have been thrilled with the performance of MongoDB. MongoDB exceeded our average performance goal of 100ms and, just as important, is consistently performant. While it’s good to keep an eye on average times, it’s more important to watch the 90th and 99th percentile performance metrics. With MongoDB our average clip load time is around 18ms and our 99th percentile times are typically at or under 100ms.

Clip load times during the same time period during season

Conclusion

Our transition from SQL Server to MongoDB started with our largest and most critical dataset. After having gone through it, we are very happy with the performance and scalability of MongoDB and appreciate how developer-friendly it is to work with. Moving from a relational to a NoSQL database naturally has a learning curve. Now that we are over it we feel very good about our ability to scale well into the future. Perhaps most telling of all, most new feature development at Hudl is done in MongoDB. We feel MongoDB lets us focus more on writing features to help coaches win and less time crafting database scripts.

Sign up for the MongoDB Newsletter to get MongoDB updates right to your inbox

声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
MySQL的角色:Web应用程序中的数据库MySQL的角色:Web应用程序中的数据库Apr 17, 2025 am 12:23 AM

MySQL在Web应用中的主要作用是存储和管理数据。1.MySQL高效处理用户信息、产品目录和交易记录等数据。2.通过SQL查询,开发者能从数据库提取信息生成动态内容。3.MySQL基于客户端-服务器模型工作,确保查询速度可接受。

mysql:构建您的第一个数据库mysql:构建您的第一个数据库Apr 17, 2025 am 12:22 AM

构建MySQL数据库的步骤包括:1.创建数据库和表,2.插入数据,3.进行查询。首先,使用CREATEDATABASE和CREATETABLE语句创建数据库和表,然后用INSERTINTO语句插入数据,最后用SELECT语句查询数据。

MySQL:一种对数据存储的初学者友好方法MySQL:一种对数据存储的初学者友好方法Apr 17, 2025 am 12:21 AM

MySQL适合初学者,因为它易用且功能强大。1.MySQL是关系型数据库,使用SQL进行CRUD操作。2.安装简单,需配置root用户密码。3.使用INSERT、UPDATE、DELETE、SELECT进行数据操作。4.复杂查询可使用ORDERBY、WHERE和JOIN。5.调试需检查语法,使用EXPLAIN分析查询。6.优化建议包括使用索引、选择合适数据类型和良好编程习惯。

MySQL初学者友好吗?评估学习曲线MySQL初学者友好吗?评估学习曲线Apr 17, 2025 am 12:19 AM

MySQL适合初学者,因为:1)易于安装和配置,2)有丰富的学习资源,3)SQL语法直观,4)工具支持强大。尽管如此,初学者需克服数据库设计、查询优化、安全管理和数据备份等挑战。

SQL是一种编程语言吗?澄清术语SQL是一种编程语言吗?澄清术语Apr 17, 2025 am 12:17 AM

是的,sqlisaprogramminglanguges pecialized fordatamanage.1)它具有焦点,focusingonwhattoachieveratherthanhow.2)sqlisessential forquerying forquerying,插入,更新,更新,和detletingdatainrelationalDatabases.3)

解释酸的特性(原子,一致性,隔离,耐用性)。解释酸的特性(原子,一致性,隔离,耐用性)。Apr 16, 2025 am 12:20 AM

ACID属性包括原子性、一致性、隔离性和持久性,是数据库设计的基石。1.原子性确保事务要么完全成功,要么完全失败。2.一致性保证数据库在事务前后保持一致状态。3.隔离性确保事务之间互不干扰。4.持久性确保事务提交后数据永久保存。

MySQL:数据库管理系统与编程语言MySQL:数据库管理系统与编程语言Apr 16, 2025 am 12:19 AM

MySQL既是数据库管理系统(DBMS),也与编程语言紧密相关。1)作为DBMS,MySQL用于存储、组织和检索数据,优化索引可提高查询性能。2)通过SQL与编程语言结合,嵌入在如Python中,使用ORM工具如SQLAlchemy可简化操作。3)性能优化包括索引、查询、缓存、分库分表和事务管理。

mySQL:使用SQL命令管理数据mySQL:使用SQL命令管理数据Apr 16, 2025 am 12:19 AM

MySQL使用SQL命令管理数据。1.基本命令包括SELECT、INSERT、UPDATE和DELETE。2.高级用法涉及JOIN、子查询和聚合函数。3.常见错误有语法、逻辑和性能问题。4.优化技巧包括使用索引、避免SELECT*和使用LIMIT。

See all articles

热AI工具

Undresser.AI Undress

Undresser.AI Undress

人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

Undress AI Tool

免费脱衣服图片

Clothoff.io

Clothoff.io

AI脱衣机

AI Hentai Generator

AI Hentai Generator

免费生成ai无尽的。

热门文章

R.E.P.O.能量晶体解释及其做什么(黄色晶体)
1 个月前By尊渡假赌尊渡假赌尊渡假赌
R.E.P.O.最佳图形设置
1 个月前By尊渡假赌尊渡假赌尊渡假赌
R.E.P.O.如果您听不到任何人,如何修复音频
1 个月前By尊渡假赌尊渡假赌尊渡假赌
R.E.P.O.聊天命令以及如何使用它们
1 个月前By尊渡假赌尊渡假赌尊渡假赌

热工具

记事本++7.3.1

记事本++7.3.1

好用且免费的代码编辑器

Atom编辑器mac版下载

Atom编辑器mac版下载

最流行的的开源编辑器

适用于 Eclipse 的 SAP NetWeaver 服务器适配器

适用于 Eclipse 的 SAP NetWeaver 服务器适配器

将Eclipse与SAP NetWeaver应用服务器集成。

SecLists

SecLists

SecLists是最终安全测试人员的伙伴。它是一个包含各种类型列表的集合,这些列表在安全评估过程中经常使用,都在一个地方。SecLists通过方便地提供安全测试人员可能需要的所有列表,帮助提高安全测试的效率和生产力。列表类型包括用户名、密码、URL、模糊测试有效载荷、敏感数据模式、Web shell等等。测试人员只需将此存储库拉到新的测试机上,他就可以访问到所需的每种类型的列表。

VSCode Windows 64位 下载

VSCode Windows 64位 下载

微软推出的免费、功能强大的一款IDE编辑器