A Data Pipeline for illion movies and million streaming links-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

A Data Pipeline for illion movies and million streaming links

Patricia Arquette

Dec 27, 2024 pm 03:02 PM

Feb 2023: I wanted to see all scores for movies tv shows and where to stream them on one page but couldn't find an aggregator that included all sources that were relevant for me.

Mar 2023: So, I built an MVP that grabbed scores on the fly and put the site online. It worked, but was slow (10 seconds to display scores).

Oct 2023: Realizing that storing data on my side is a necessity, I discovered windmill.dev. It eclipses similar orchestration engines easily - at least for my needs.

Fast forward to today and after 12 months of continuous data munching, I want to share how the pipeline works in detail. You'll learn how to build a complex system that grabs data from many different sources, normalizes data and combines it into an optimized format for querying.

Pics or didn't happen!

A Data Pipeline for illion movies and million streaming links

This is the Runs view. Every dot represents a flow run. A flow can be anything, for example a simple one-step script:

A Data Pipeline for illion movies and million streaming links

The block in the center contains a script like this (simplified):

def main():
    return tmdb_extract_daily_dump_data()

def tmdb_extract_daily_dump_data():
    print("Checking TMDB for latest daily dumps")
    init_mongodb()

    daily_dump_infos = get_daily_dump_infos()
    for daily_dump_info in daily_dump_infos:
        download_zip_and_store_in_db(daily_dump_info)

    close_mongodb()
    return [info.to_mongo() for info in daily_dump_infos]

[...]

The following beast is also a flow (remember, this is only one of the green dots):

A Data Pipeline for illion movies and million streaming links

(higher resolution image: https://i.imgur.com/LGhTUGG.png)

Let's break this one down:

Get the next prioritized movie or tv show (see next section)
Get up-to-date data from TMDB
Scrape IMDb, Metacritic and Rotten Tomatoes for current scores
Scrape TV Tropes for... tropes
Huggingface API to gather DNA data (will explain below)
Store high dimensional vectors for DNA data
Store relational data for movies, shows and streaming links

Each of those steps are more or less complex and involve using async processes.

Where do you start? Priority Queue

To determine which titles to pick next there are two lanes that are processed in parallel. This is another area where Windmill shines. Parallelization and orchestration works flawlessly with their architecture.

The two lanes to pick the next item are:

Lane 1: Flows for each data source separately

First of all, titles that don't have any data attached will be selected for each data source. That means if the Metacritic pipeline has a movie that wasn't scraped yet, it will be selected next. This makes sure that every title was processed at least once, including new ones.

Once every title has attached data, the pipeline selects those with the least recent data.

Here is example of such a flow run, here with an error because the rate limit was hit:

A Data Pipeline for illion movies and million streaming links

Windmill allows you to define retries for each step in the flow easily. In this case, the logic is to retry three times in case of errors. Unless the rate limit was hit (which is usually a different status code or error message), then we stop immediately.

Lane 2: Priority Flow for each movie/show separately

The above works, but has a serious issue: recent releases are not updated timely enough. It can take weeks or even months until every data aspect has been successfully fetched. For example, it can happen that a movie has a recent IMDb score, but the other scores are outdated and the streaming links are missing completely. Especially for scores and streaming availability I wanted to achieve a much better accuracy.

To solve this problem, the second lane focuses on a different prioritization strategy: The most popular and trending movies/shows are selected for a complete data refresh across all data sources. I showed this flow before, it's the one I referred to as beast earlier.

Titles that are shown more often on the app get a priority boost as well. That means that every time a movie or show is coming up in the top search results or when their details view is opened, they will likely be refreshed soon.

Every title can only be refreshed once per week using the priority lane to ensure that we don't fetch data that likely hasn't changed in the meantime.

Are you allowed to do this? Scraping Considerations

You might ask: Is scraping legal? The act of grabbing the data is normally fine. What you do with the data needs careful consideration though. As soon as you make profit from a service that uses scraped data, you are probably violating their terms and conditions. (see The Legal Landscape of Web Scraping and ‘Scraping’ Is Just Automated Access, and Everyone Does It)

Scraping and related laws are new and often untested and there is a lot of legal gray area. I'm determined to cite every source accordingly, respect rate limits and avoid unnecessary requests to minimize impact on their services.

Fact is, the data will not be used to make profit. GoodWatch will be free to use for everyone forever.

More Work? Yes, Milord

Windmill uses workers to distribute code execution across multiple processes. Each step in a flow is sent to a worker, which makes them independent from actual business logic. Only the main app orchestrates the jobs, whereas workers only receive input data, code to execute and return the result.

It's an efficient architecture that scales nicely. Currently, there are 12 workers splitting the work. They're all hosted on Hetzner.

Each worker has a maximum resource consumption of 1 vCPU and 2 GB of RAM. Here is an overview:

A Data Pipeline for illion movies and million streaming links

Windmill Editor

Windmill offers an in-browser IDE-like editor experience with linting, auto-formatting, an AI assistant and even collaborative editing (last one is a paid feature). The best thing is this button though:

A Data Pipeline for illion movies and million streaming links

It allows me to quickly iterate and test scripts before deploying them. I usually edit and test files in the browser and push them to git when I'm finished.

Only thing that's missing for an optimal coding environment are debugging tools (breakpoints & variable context). Currently, I'm debugging scripts in my local IDE to overcome this weakness.

Numbers. I like Numbers

Me too!

Currently GoodWatch requires around 100 GB of persistent data storage:

15 GB for raw preprocessing data (MongoDB)
23 GB for processed relational data (Postgres)
67 GB for vector data (Postgres)

Every day 6.500 flows run through Windmill's orchestration engine. This results in a daily volume of:

30.000 IMDb pages
9.000 TV Tropes pages
5.000 Rotten Tomatoes pages
1.500 Huggingface prompts
600 Metacritic pages

These numbers are fundamentally different because of different rate limit policies.

Once per day, data is cleaned up and combined into the final data format. Currently the database that powers the GoodWatch webapp stores:

10 million streaming links
1 million movies
300k DNA values
200k tv shows
70k movies/shows with DNA

What's that DNA you keep talking about?

Imagine you could only distinguish movies by their genre, extremely limiting right?

That's why I started the DNA project. It allows categorizing movies and shows by other attributes like Mood, Plot Elements, Character Types, Dialog or Key Props.

Here are the top 10 of all DNA values over all items:

A Data Pipeline for illion movies and million streaming links

It allows two things:

Filter by DNA values (using relational data)
Search by similarity (using vector data)

Examples:

Melancholic Mood
Similar Story as Dune: Part Two

There will be dedicated blog post about the DNA with many more details in the future.

Deeper Dive into the Data Pipeline

To fully understand how the data pipeline works, here is a breakdown what happens for each data source:

1. Once a day, a MongoDB collection is updated with all required input data

For each data source there is an ìnit flow that prepares a MongoDB collection with all required data. For IMDb, that's just the imdb_id. For Rotten Tomatoes, the title and release_year are required. That's because the ID is unknown and we need to guess the correct URL based on the name.

2. Continuously fetch data and write it into the MongoDB collection

Based on the priority selection explained above, items in the prepared collections are updated with the data that is fetched. Each data source has their own collection which gets more and more complete over time.

3. Once a day, various flows collect the data from the MongoDB collections and write them into Postgres

There is a flow for movies, one for tv shows and another one for streaming links. They collect all necessary data from various collections and store them in their respective Postgres tables, which are then queried by the web application.

Here is an excerpt of the copy movies flow and script:

A Data Pipeline for illion movies and million streaming links

Some of these flows take a long time to execute, sometimes even longer than 6 hours. This can be optimized by flagging all items that were updated and only copying those instead of batch processing the whole data set. One of many TODO items on my list ?

Scheduling

Scheduling is as easy as defining cron expressions for each flow or script that needs to be executed automatically:

A Data Pipeline for illion movies and million streaming links

Here is an excerpt of all schedules that are defined for GoodWatch:

A Data Pipeline for illion movies and million streaming links

In total there are around 50 schedules defined.

Challenges

With great data comes great responsibility. Lots can go wrong. And it did.

Very slow processing

Early versions of my scripts were taking ages to update all entries in a collection or table. That was because I upserted every item individually. That causes a lot of overhead and slows down the process significantly.

A much better approach is to collect data to be upserted and batch the database queries. Here is an example for MongoDB:

def main():
    return tmdb_extract_daily_dump_data()

def tmdb_extract_daily_dump_data():
    print("Checking TMDB for latest daily dumps")
    init_mongodb()

    daily_dump_infos = get_daily_dump_infos()
    for daily_dump_info in daily_dump_infos:
        download_zip_and_store_in_db(daily_dump_info)

    close_mongodb()
    return [info.to_mongo() for info in daily_dump_infos]

[...]

Memory hungry scripts

Even with batch processing, some scripts consumed so much memory that the workers crashed. The solution was to carefully fine-tune the batch size for every use case.

Some batches are fine to run in steps of 5000, others store much more data in memory and run better with 500.

Windmill has a great feature to observe the memory while a script is running:

A Data Pipeline for illion movies and million streaming links

Key Takeaways

Windmill is a great asset in any developer's toolkit for automating tasks. It's been an invaluable productivity booster for me, allowing me to focus on the flow structure and business logic while outsourcing the heavy lifting of task orchestration, error handling, retries and caching.

Handling large volumes of data is still challenging, and optimizing the pipeline is an ongoing process - but I'm really happy with how everything has turned out so far.

Okay, okay. That's enough

Thought so. Just let me link a few resources and we're finished:

GoodWatch
GoodWatch Discord Community
Windmill
Windmill Discord Community

Did you know that GoodWatch is open-source? You can take a look at all scripts and flow definitions in this repository: https://github.com/alp82/goodwatch-monorepo/tree/main/goodwatch-flows/windmill/f

Let me know if you have any questions.

The above is the detailed content of A Data Pipeline for illion movies and million streaming links. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How do you slice a Python array?May 01, 2025 am 12:18 AM

The basic syntax for Python list slicing is list[start:stop:step]. 1.start is the first element index included, 2.stop is the first element index excluded, and 3.step determines the step size between elements. Slices are not only used to extract data, but also to modify and invert lists.

Under what circumstances might lists perform better than arrays?May 01, 2025 am 12:06 AM

Listsoutperformarraysin:1)dynamicsizingandfrequentinsertions/deletions,2)storingheterogeneousdata,and3)memoryefficiencyforsparsedata,butmayhaveslightperformancecostsincertainoperations.

How can you convert a Python array to a Python list?May 01, 2025 am 12:05 AM

ToconvertaPythonarraytoalist,usethelist()constructororageneratorexpression.1)Importthearraymoduleandcreateanarray.2)Uselist(arr)or[xforxinarr]toconvertittoalist,consideringperformanceandmemoryefficiencyforlargedatasets.

What is the purpose of using arrays when lists exist in Python?May 01, 2025 am 12:04 AM

ChoosearraysoverlistsinPythonforbetterperformanceandmemoryefficiencyinspecificscenarios.1)Largenumericaldatasets:Arraysreducememoryusage.2)Performance-criticaloperations:Arraysofferspeedboostsfortaskslikeappendingorsearching.3)Typesafety:Arraysenforc

Explain how to iterate through the elements of a list and an array.May 01, 2025 am 12:01 AM

In Python, you can use for loops, enumerate and list comprehensions to traverse lists; in Java, you can use traditional for loops and enhanced for loops to traverse arrays. 1. Python list traversal methods include: for loop, enumerate and list comprehension. 2. Java array traversal methods include: traditional for loop and enhanced for loop.

What is Python Switch Statement?Apr 30, 2025 pm 02:08 PM

The article discusses Python's new "match" statement introduced in version 3.10, which serves as an equivalent to switch statements in other languages. It enhances code readability and offers performance benefits over traditional if-elif-el

What are Exception Groups in Python?Apr 30, 2025 pm 02:07 PM

Exception Groups in Python 3.11 allow handling multiple exceptions simultaneously, improving error management in concurrent scenarios and complex operations.

What are Function Annotations in Python?Apr 30, 2025 pm 02:06 PM

Function annotations in Python add metadata to functions for type checking, documentation, and IDE support. They enhance code readability, maintenance, and are crucial in API development, data science, and library creation.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks agoByDDD

Where to find the Site Office Key in Atomfall

4 weeks agoByDDD

Hot Tools

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Atom editor mac version download

The most popular open source editor

Dreamweaver CS6

Visual web development tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Hot Topics

Where is the login entrance for gmail email?

7864

1649

1407

1301

1243