Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?-AI-php.cn

Home

Technology peripherals

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

王林

Apr 12, 2023 pm 03:55 PM

githubsearch

In December 2021, GitHub released a technology preview and conducted a comprehensive optimization to address the problem of "nothing can be found" in GitHub code search.

Last November, at the GitHub Universe Developer Conference, the official released the public beta version, which mainly solves the problem of developers finding, reading and navigating. Code problem.

At the conference, someone asked an important question, What is the principle behind the improvement of "code search"?

Recently, GitHub published a blog explaining in detail the technical principles and system architecture behind the new model.

Build GitHub search engine from scratch

Simply put, behind the new search engine is a wheel that researchers rewrote in Rust. Specially optimized for code search, codename Blackbird.

At first glance, building a search engine from scratch may seem like a puzzling decision: Why start from scratch? Aren’t there already many open source solutions out there? Why waste any more energy building a new thing?

In fact, GitHub has been trying to use existing solutions to solve the search problem, but unfortunately, products for general text search are difficult to adapt to Search for "code". Because the indexing speed is too slow, the user experience is poor, and the number of servers required is large and the running costs are too high.

Although there are some newer open source projects specifically adapted to code search, they are still not suitable for a code base as large as GitHub.

Based on the above observations, GitHub developers have set three main goals and conclusions:

1. Users are searching You can get a new experience during the process. You can get answers by asking some code questions and iteratively searching, browsing, navigating (navigate) and reading the code.

#2. There are many differences between code search and general text search.

Developers write code to be understood by machines, so the code search process should take advantage of the structure and relevance of the code; and users may search for punctuation (e.g., operators such as periods, open brackets, etc. in the code); do not stem words in the code; do not remove stop words from the query; or use regular expressions to search.

3. GitHub’s scale does present a unique challenge.

The old version of the search engine used Elasticsearch, and when it was first deployed it took several months to index all the code on GitHub (there were about 800 at the time Ten thousand code libraries), but now the number of code repositories has exceeded 200 million, and these codes are not static: developers are constantly submitting, and the code is constantly changing, which is very challenging for building a search engine.

Currently in beta, users can search approximately 45 million code bases, containing 115TB of code and 15.5 billion documents.

To sum up, ready-made things cannot meet the needs, so build one from scratch.

Try Grep?

When searching, a commonly used tool is "grep". By entering an expression, you can match in the text, so why not just use grep to brute force the search problem?

To answer this question, you can first calculate how long it takes to match 115TB of code with ripgrep.

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

On a machine with an 8-core Intel CPU, ripgrep can be run in 2.769 seconds (about 0.6 GB/ sec/core) runs a regular expression query against a 13 GB file cached in memory.

After simple calculation, we can find that this method is not feasible for the current massive data: Assume that the code search program is running on a cluster with 32 servers, and each machine has 64 Core, even if all 115TB of code is put into memory and everything runs smoothly, it will take about 96 seconds for the 2,048 CPU cores to run "one" query, and there can only be one. Other users have to queue up, which means that only QPS is 0.01 can only be used with grep

, so brute force will not work, so we can only build an index first.

Search index (serach index)

Only when relevant information is pre-calculated in the form of an index can the search engine respond quickly when querying. To put it simply, The index is a key-value mapping. In the case of an inverted index, the key is a keyword and the value is the ordered list of document IDs in which the word appears.

In the code search task, the researchers used a special type of inverted index, the ngram index.

An ngram is a character sequence of length n. For example, n = 3 (trigams) means that the maximum length of the key can only be 3. For longer keys, it is necessary Cut according to length 3. For example, limits are divided into lim, imi, mit and its

When performing a search, the query results of multiple keys are combined and the string is obtained after merging. List of documents that appear

The next question is how to complete the construction of the index in a relatively reasonable time.

The researchers observed that Git uses content-addressed hashing and that there is actually quite a lot of duplicate content on GitHub, so the researchers proposed the following two methods to build indexes.

1. shard by Git blob object ID provides a good way to evenly distribute documents among shards, and can avoid duplication while being able to use it as needed Expand the number of shards at any time.

2. Model the index as a tree, and use differential encoding (delta encoding) to reduce the number of crawling and optimize the metadata in the index, where metadata The data includes a list of where the document appears (which path, branch, and repository) as well as information about these objects (repository name, owner, visibility, etc.). For some popular repositories, the amount of metadata can be quite large.

GitHub has also designed a system to keep query results consistent with the submitted code.

If a user searches the code base while a team member is pushing code, then the newly submitted documents should not be included in the search results until the system has fully processed the newly submitted documents, and Blackbird will make the commit query consistent sex as a core part of its design.

Blackbird

The following is the architecture diagram of the Blackbird search engine.

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

First, Kafka will provide events to specify the content of the index, and then there will be a large number of crawler programs. Interact with Git, which also has a service to extract symbols from the code; again use Kafka to index each shard and obtain the target document.

Although the system only responds to events like "git push" to fetch changes, there is some additional work required to ingest all code bases for the first time.

A key feature of this system is the optimization of the initial ingest order to take full advantage of incremental encoding.

GitHub uses a new probabilistic data structure to represent the similarity of code bases, and is calculated from a horizontal sequential traversal of a minimum spanning tree of the code base similarity graph. The order of ingest.

Based on the optimized ingest order, the construction process of the delta tree is to differentiate each code base from its parent code base, which also means that the system only needs to crawl the current code base Unique to blobs, crawling involves retrieving blob content from Git, parsing it to extract symbols, and creating documents that will serve as input to the index.

The files are then published to another Kafka topic, which is also where the data is partitioned between shards. Each shard uses a Kafka partition in the topic.

Using Kafka can decouple the index from crawl, and the sorting of messages in Kafka can also make the query results consistent.

The indexer shards then find these documents and build the index: tokenizing content, symbols and paths to construct ngram indices and other useful indices (language, owner, code base, etc.), and flushing the results to disk .

Finally, compact the shards and collapse the smaller index into a larger index, so that the query is more efficient and the movement is easier.

The life cycle of query

With the index, tracking the query through the system becomes simple. Each query is a regular expression, which can be written as " /arguments?/ org:rails lang:Ruby", which searches for a code written in Ruby language by the Rails organization.

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

##There is also a service between GitHub.com and shards, which is responsible for coordinating the reception of user queries and dispersing the requests to each host in the search cluster, and finally use Redis to manage disk space (quotas) and cache some access control data.

The front end accepts a user query and passes it to Blackbird, which then parses the query into an abstract syntax tree, rewrites it into the canonical language ID, and adds Mark the sentences with permissions and scope.

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

The next step will be to send n concurrent requests: one to each shard in the search cluster, and the system will A certain sharding strategy is to send query requests to each shard in the cluster.

Then, the query is transformed somewhat on each individual shard to find the information in the index.

Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?

Finally aggregate the results returned by all shards, reorder by score, filter (check permissions again), and return to top 100, then the front end of GitHub.com performs syntax highlighting, term highlighting, pagination, and finally we can present the results to the page.

In actual use, the p99 response time of a single shard is about 100ms. However, due to reasons such as aggregating responses, checking permissions, and syntax highlighting, the total response time is longer.

One query occupies one CPU core on the index server for 100 milliseconds, so the upper limit of a 64-core host is approximately 640 queries per second. Compared with the grep method (0.01 QPS), this method can be said to be quite fast.

Summary

After the complete system architecture is introduced, you can re-examine the scale of the problem.

GitHub's ingest pipeline can publish about 120,000 documents per second, so it will take about 36 hours to process all 15.5 billion documents; however, delta indexing can reduce More than 50% of the number of documents needed to be crawled allowed the entire process to re-index the entire corpus in about 18 hours.

Some breakthroughs have been made in terms of index scale. The initial content volume was 115TB. After deleting duplicate content and using incremental indexing, the amount of content was reduced to about 28TB.

And the index itself is only 25TB, which includes not only all indexes (including ngrams), but also compressed copies of all unique content, which also means that the total index size including content is approximately Only a quarter of the original data size!

The above is the detailed content of Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 Mac version

God-level code editing software (SublimeText3)

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Hot Topics

Where is the login entrance for gmail email?

7582

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

117