Home > Article > Technology peripherals > Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?
In December 2021, GitHub released a technology preview and conducted a comprehensive optimization to address the problem of "nothing can be found" in GitHub code search.
Last November, at the GitHub Universe Developer Conference, the official released the public beta version, which mainly solves the problem of developers finding, reading and navigating. Code problem.
At the conference, someone asked an important question, What is the principle behind the improvement of "code search"?
Recently, GitHub published a blog explaining in detail the technical principles and system architecture behind the new model.
Simply put, behind the new search engine is a wheel that researchers rewrote in Rust. Specially optimized for code search, codename Blackbird.
At first glance, building a search engine from scratch may seem like a puzzling decision: Why start from scratch? Aren’t there already many open source solutions out there? Why waste any more energy building a new thing?
In fact, GitHub has been trying to use existing solutions to solve the search problem, but unfortunately, products for general text search are difficult to adapt to Search for "code". Because the indexing speed is too slow, the user experience is poor, and the number of servers required is large and the running costs are too high.
Although there are some newer open source projects specifically adapted to code search, they are still not suitable for a code base as large as GitHub.
Based on the above observations, GitHub developers have set three main goals and conclusions:
1. Users are searching You can get a new experience during the process. You can get answers by asking some code questions and iteratively searching, browsing, navigating (navigate) and reading the code.
#2. There are many differences between code search and general text search.
Developers write code to be understood by machines, so the code search process should take advantage of the structure and relevance of the code; and users may search for punctuation (e.g., operators such as periods, open brackets, etc. in the code); do not stem words in the code; do not remove stop words from the query; or use regular expressions to search.
3. GitHub’s scale does present a unique challenge.
The old version of the search engine used Elasticsearch, and when it was first deployed it took several months to index all the code on GitHub (there were about 800 at the time Ten thousand code libraries), but now the number of code repositories has exceeded 200 million, and these codes are not static: developers are constantly submitting, and the code is constantly changing, which is very challenging for building a search engine.
Currently in beta, users can search approximately 45 million code bases, containing 115TB of code and 15.5 billion documents.
To sum up, ready-made things cannot meet the needs, so build one from scratch.
When searching, a commonly used tool is "grep". By entering an expression, you can match in the text, so why not just use grep to brute force the search problem?
To answer this question, you can first calculate how long it takes to match 115TB of code with ripgrep.
On a machine with an 8-core Intel CPU, ripgrep can be run in 2.769 seconds (about 0.6 GB/ sec/core) runs a regular expression query against a 13 GB file cached in memory.
After simple calculation, we can find that this method is not feasible for the current massive data: Assume that the code search program is running on a cluster with 32 servers, and each machine has 64 Core, even if all 115TB of code is put into memory and everything runs smoothly, it will take about 96 seconds for the 2,048 CPU cores to run "one" query, and there can only be one. Other users have to queue up, which means that only QPS is 0.01 can only be used with grep
, so brute force will not work, so we can only build an index first.
Only when relevant information is pre-calculated in the form of an index can the search engine respond quickly when querying. To put it simply, The index is a key-value mapping. In the case of an inverted index, the key is a keyword and the value is the ordered list of document IDs in which the word appears.
In the code search task, the researchers used a special type of inverted index, the ngram index.
An ngram is a character sequence of length n. For example, n = 3 (trigams) means that the maximum length of the key can only be 3. For longer keys, it is necessary Cut according to length 3. For example, limits are divided into lim, imi, mit and its
When performing a search, the query results of multiple keys are combined and the string is obtained after merging. List of documents that appear
The next question is how to complete the construction of the index in a relatively reasonable time.
The researchers observed that Git uses content-addressed hashing and that there is actually quite a lot of duplicate content on GitHub, so the researchers proposed the following two methods to build indexes.
1. shard by Git blob object ID provides a good way to evenly distribute documents among shards, and can avoid duplication while being able to use it as needed Expand the number of shards at any time.
2. Model the index as a tree, and use differential encoding (delta encoding) to reduce the number of crawling and optimize the metadata in the index, where metadata The data includes a list of where the document appears (which path, branch, and repository) as well as information about these objects (repository name, owner, visibility, etc.). For some popular repositories, the amount of metadata can be quite large.
GitHub has also designed a system to keep query results consistent with the submitted code.
If a user searches the code base while a team member is pushing code, then the newly submitted documents should not be included in the search results until the system has fully processed the newly submitted documents, and Blackbird will make the commit query consistent sex as a core part of its design.
The following is the architecture diagram of the Blackbird search engine.
First, Kafka will provide events to specify the content of the index, and then there will be a large number of crawler programs. Interact with Git, which also has a service to extract symbols from the code; again use Kafka to index each shard and obtain the target document.
Although the system only responds to events like "git push" to fetch changes, there is some additional work required to ingest all code bases for the first time.
A key feature of this system is the optimization of the initial ingest order to take full advantage of incremental encoding.
GitHub uses a new probabilistic data structure to represent the similarity of code bases, and is calculated from a horizontal sequential traversal of a minimum spanning tree of the code base similarity graph. The order of ingest.
Based on the optimized ingest order, the construction process of the delta tree is to differentiate each code base from its parent code base, which also means that the system only needs to crawl the current code base Unique to blobs, crawling involves retrieving blob content from Git, parsing it to extract symbols, and creating documents that will serve as input to the index.
The files are then published to another Kafka topic, which is also where the data is partitioned between shards. Each shard uses a Kafka partition in the topic.
Using Kafka can decouple the index from crawl, and the sorting of messages in Kafka can also make the query results consistent.
The indexer shards then find these documents and build the index: tokenizing content, symbols and paths to construct ngram indices and other useful indices (language, owner, code base, etc.), and flushing the results to disk .
Finally, compact the shards and collapse the smaller index into a larger index, so that the query is more efficient and the movement is easier.
With the index, tracking the query through the system becomes simple. Each query is a regular expression, which can be written as " /arguments?/ org:rails lang:Ruby", which searches for a code written in Ruby language by the Rails organization.
##There is also a service between GitHub.com and shards, which is responsible for coordinating the reception of user queries and dispersing the requests to each host in the search cluster, and finally use Redis to manage disk space (quotas) and cache some access control data.
The front end accepts a user query and passes it to Blackbird, which then parses the query into an abstract syntax tree, rewrites it into the canonical language ID, and adds Mark the sentences with permissions and scope.
The next step will be to send n concurrent requests: one to each shard in the search cluster, and the system will A certain sharding strategy is to send query requests to each shard in the cluster.
Then, the query is transformed somewhat on each individual shard to find the information in the index.
Finally aggregate the results returned by all shards, reorder by score, filter (check permissions again), and return to top 100, then the front end of GitHub.com performs syntax highlighting, term highlighting, pagination, and finally we can present the results to the page.
In actual use, the p99 response time of a single shard is about 100ms. However, due to reasons such as aggregating responses, checking permissions, and syntax highlighting, the total response time is longer.
One query occupies one CPU core on the index server for 100 milliseconds, so the upper limit of a 64-core host is approximately 640 queries per second. Compared with the grep method (0.01 QPS), this method can be said to be quite fast.
SummaryAfter the complete system architecture is introduced, you can re-examine the scale of the problem.
GitHub's ingest pipeline can publish about 120,000 documents per second, so it will take about 36 hours to process all 15.5 billion documents; however, delta indexing can reduce More than 50% of the number of documents needed to be crawled allowed the entire process to re-index the entire corpus in about 18 hours.
Some breakthroughs have been made in terms of index scale. The initial content volume was 115TB. After deleting duplicate content and using incremental indexing, the amount of content was reduced to about 28TB.
And the index itself is only 25TB, which includes not only all indexes (including ngrams), but also compressed copies of all unique content, which also means that the total index size including content is approximately Only a quarter of the original data size!
The above is the detailed content of Abandon ElasticSearch, GitHub builds a search engine from scratch! How to search the 200 million code warehouse?. For more information, please follow other related articles on the PHP Chinese website!