search
HomeBackend DevelopmentPython TutorialHow to scrape Crunchbase using Python in Easy Guide)

Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective Crunchbase scraper in Python that gets you the data you need.

Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format.

By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.

Note: This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on Discord to share your experiences and blog ideas - we value these contributions from developers like you.

Key steps we'll cover:

  1. Project setup
  2. Choosing the data source
  3. Implementing sitemap-based crawler
  4. Analysis of search-based approach and its limitations
  5. Implementing the official API crawler
  6. Conclusion and repository access

Prerequisites

  • Python 3.9 or higher
  • Familiarity with web scraping concepts
  • Crawlee for Python v0.5.0
  • poetry v2.0 or higher

Project setup

Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (Playwright and Beautifulsoup), so we'll set up the project manually.

  1. Install Poetry

    pipx install poetry
    
  2. Create and navigate to the project folder.

    mkdir crunchbase-crawlee && cd crunchbase-crawlee
    
  3. Initialize the project using Poetry, leaving all fields empty.

    poetry init
    

    When prompted:

    • For "Compatible Python versions", enter: >={your Python version},=3.10,
    • Leave all other fields empty by pressing Enter
    • Confirm the generation by typing "yes"
  4. Add and install Crawlee with necessary dependencies to your project using Poetry.

    poetry add crawlee[parsel,curl-impersonate]
    
  5. Complete the project setup by creating the standard file structure for Crawlee for Python projects.

    mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}
    

After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase.

Choosing the data source

While we can extract target data directly from the company page, we need to choose the best way to navigate the site.

A careful examination of Crunchbase's structure shows that we have three main options for obtaining data:

  1. Sitemap - for complete site traversal.
  2. Search - for targeted data collection.
  3. Official API - recommended method.

Let's examine each of these approaches in detail.

Scraping Crunchbase using sitemap and Crawlee for Python

Sitemap is a standard way of site navigation used by crawlers like Google, Ahrefs, and other search engines. All crawlers must follow the rules described in robots.txt.

Let's look at the structure of Crunchbase's Sitemap:

Sitemap first lvl

As you can see, links to organization pages are located inside second-level Sitemap files, which are compressed using gzip.

The structure of one of these files looks like this:

Sitemap second lvl

The lastmod field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates.

1. Configuring the crawler for scraping

To work with the site, we'll use CurlImpersonateHttpClient, which impersonates a Safari browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features.

The reason is that Crunchbase uses Cloudflare to protect against automated access. This is clearly visible when analyzing traffic on a company page:

Cloudflare Link

An interesting feature is that challenges.cloudflare is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data.

Cloudflare also analyzes traffic at the sitemap level. If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser.

To prevent blocks due to overly aggressive crawling, we'll configure ConcurrencySettings.

When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the documentation.

We'll save our scraping results in JSON format. Here's how the basic crawler configuration looks:

pipx install poetry

2. Implementing sitemap navigation

Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information:

pipx install poetry

In the second stage, we process second-level sitemap files stored in gzip format. This requires a special approach as the data needs to be decompressed first:

mkdir crunchbase-crawlee && cd crunchbase-crawlee

3. Extracting and saving data

Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: Company Name, Short Description, Website, and Location.

One of Crunchbase's advantages is that all data is stored in JSON format within the page:

Company Data

This significantly simplifies data extraction - we only need to use one Xpath selector to get the JSON, and then apply jmespath to extract the needed fields:

poetry init

The collected data is saved in Crawlee for Python's internal storage using the context.push_data method. When the crawler finishes, we export all collected data to a JSON file:

poetry add crawlee[parsel,curl-impersonate]

4. Running the project

With all components in place, we need to create an entry point for our crawler:

mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}

Execute the crawler using Poetry:

# main.py

from crawlee import ConcurrencySettings, HttpHeaders
from crawlee.crawlers import ParselCrawler
from crawlee.http_clients import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)

    http_client = CurlImpersonateHttpClient(
        impersonate='safari17_0',
        headers=HttpHeaders(
            {
                'accept-language': 'en',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
    )
    crawler = ParselCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])

    await crawler.export_data_json('crunchbase_data.json')

5. Finally, characteristics of using the sitemap crawler

The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases:

  • When you need to collect data about all companies on the platform
  • When there are no specific company selection criteria
  • If you have sufficient time and computational resources

However, there are significant limitations to consider:

  • Almost no ability to filter data during collection
  • Requires constant monitoring of Cloudflare blocks
  • Scaling the solution requires proxy servers, which increases project costs

Using search for scraping Crunchbase

The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages.

The key difference lies in how Cloudflare protection works. While we receive data before the challenges.cloudflare check when accessing a company page, the search API requires valid cookies that have passed this check.

Let's verify this in practice. Open the following link in Incognito mode:

# routes.py

from crawlee.crawlers import ParselCrawlingContext
from crawlee.router import Router
from crawlee import Request

router = Router[ParselCrawlingContext]()


@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'default_handler processing {context.request} ...')

    requests = [
        Request.from_url(url, label='sitemap')
        for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall()
    ]

    # Since this is a tutorial, I don't want to upload more than one sitemap link
    await context.add_requests(requests, limit=1)

When analyzing the traffic, we'll see the following pattern:

Search Protect

The sequence of events here is:

  1. First, the page is blocked with code 403
  2. Then the challenges.cloudflare check is performed
  3. Only after successfully passing the check do we receive data with code 200

Automating this process would require a headless browser capable of bypassing Cloudflare Turnstile. The current version of Crawlee for Python (v0.5.0) doesn't provide this functionality, although it's planned for future development.

You can extend the capabilities of Crawlee for Python by integrating Camoufox following this example.

Working with the official Crunchbase API

Crunchbase provides a free API with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the official API specification.

1. Setting up API access

To start working with the API, follow these steps:

  1. Create a Crunchbase account
  2. Go to the Integrations section
  3. Create a Crunchbase Basic API key

Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation.

2. Configuring the crawler for API work

An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure ConcurrencySettings. Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard 'HttpxHttpClient' with preset headers.

First, let's save the API key in an environment variable:

pipx install poetry

Here's how the crawler configuration for working with the API looks:

mkdir crunchbase-crawlee && cd crunchbase-crawlee

3. Processing search results

For working with the API, we'll need two main endpoints:

  1. get_autocompletes - for searching
  2. get_entities_organizations__entity_id - for getting data

First, let's implement search results processing:

poetry init

4. Extracting company data

After getting the list of companies, we extract detailed information about each one:

poetry add crawlee[parsel,curl-impersonate]

5. Advanced location-based search

If you need more flexible search capabilities, the API provides a special search endpoint. Here's an example of searching for all companies in Prague:

mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}

For processing search results and pagination, we use the following handler:

# main.py

from crawlee import ConcurrencySettings, HttpHeaders
from crawlee.crawlers import ParselCrawler
from crawlee.http_clients import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)

    http_client = CurlImpersonateHttpClient(
        impersonate='safari17_0',
        headers=HttpHeaders(
            {
                'accept-language': 'en',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
    )
    crawler = ParselCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])

    await crawler.export_data_json('crunchbase_data.json')

6. Finally, free API limitations

The free version of the API has significant limitations:

  • Limited set of available endpoints
  • Autocompletes function only works for company searches
  • Not all data fields are accessible
  • Limited search filtering capabilities

Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints.

What’s your best path forward?

We've explored three different approaches to obtaining data from Crunchbase:

  1. Sitemap - for large-scale data collection
  2. Search - difficult to automate due to Cloudflare protection
  3. Official API - the most reliable solution for commercial projects

Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.

The complete source code is available in my repository. Have questions or want to discuss implementation details? Join our Discord - our community of developers is there to help.

The above is the detailed content of How to scrape Crunchbase using Python in Easy Guide). For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Python vs. C  : Understanding the Key DifferencesPython vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C  : Which Language to Choose for Your Project?Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyReaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesMaximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C  : The Right Language for YouChoosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C  : A Comparative Analysis of Programming LanguagesPython vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python Learning2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C  : Learning Curves and Ease of UsePython vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment