Home  >  Article  >  Database  >  The simplest implementation of database

The simplest implementation of database

伊谢尔伦
伊谢尔伦Original
2016-11-24 11:04:59853browse

Among all application software, the database may be the most complex.

 MySQL’s manual has more than 3,000 pages, PostgreSQL’s manual has more than 2,000 pages, and Oracle’s manual is thicker than both of them combined.

The simplest implementation of database

 However, it is not difficult to write the simplest database by yourself. There is a post on Reddit that explains the principle clearly in just a few hundred words. Below is what I compiled based on this post.

1. Save data in text form

The first step is to write the data you want to save into a text file. This text file is your database.

 In order to facilitate reading, the data must be divided into records, and the length of each record is specified to be equal. For example, assuming that the length of each record is 800 bytes, the starting position of the fifth record is 3200 bytes.

Most of the time, we don’t know the position of a certain record, we only know the value of the primary key. At this time, in order to read the data, you can compare the records one by one. However, this is too inefficient. In practical applications, databases often use B-tree format to store data.

2. What is a B-tree?

To understand B-tree, we must start from the binary search tree.

The simplest implementation of database

 Binary search tree is a data structure with very high search efficiency. It has three characteristics.

(1) Each node has at most two subtrees.

(2) The left subtree has a value less than the parent node, and the right subtree has a value greater than the parent node.

(3) To find the target value among n nodes, generally only log(n) comparisons are required.

 The structure of the binary search tree is not suitable for databases because its search efficiency is related to the number of levels. The lower the data is, the more comparisons are needed. In extreme cases, n data requires n comparisons to find the target value. For the database, every time you enter a layer, you have to read data from the hard disk. This is very fatal, because the reading time of the hard disk is much longer than the data processing time. The fewer times the database reads the hard disk, the better.

 The B-tree is an improvement on the binary search tree. Its design idea is to gather related data together as much as possible so that multiple data can be read at one time and the number of hard disk operations can be reduced.

The simplest implementation of database

B-tree also has three characteristics.

(1) A node can hold multiple values. For example, in the figure above, the largest node holds 4 values.

(2) New layers will not be added unless the data is already filled. In other words, B-tree pursues as few "layers" as possible.

(3) The value in the child node has a strict size correspondence with the value in the parent node. Generally speaking, if the parent node has a value, then there are a+1 child nodes. For example, in the picture above, the parent node has two values ​​(7 and 16), which correspond to three child nodes. The first child node has a value less than 7, the last child node has a value greater than 16, and the middle child node It's a value between 7 and 16.

  This data structure is very helpful in reducing the number of reads from the hard disk. Assuming that a node can hold 100 values, then a 3-layer B-tree can hold 1 million data. If it is replaced by a binary search tree, 20 layers are needed! Assuming that the operating system reads one node at a time and the root node remains in memory, then the B-tree only needs to read the hard disk twice to find the target value among 1 million pieces of data.

3. Index

The database is stored in B-tree format, which only solves the problem of searching data according to the "primary key". If you want to find other fields, you need to create an index.

 The so-called index is a B-tree file with a certain field as the key. Suppose there is an "employee table" containing two fields: employee number (primary key) and name. An index file can be created for names. This file stores names in B-tree format, and each name is followed by its position in the database (i.e. which record). When searching for a name, first find the corresponding record from the index, and then read it from the table.

 This index search method is called "Indexed Sequential Access Method", abbreviated as ISAM. It already has multiple implementations (such as C-ISAM library and D-ISAM library). As long as you use these code libraries, you can write the simplest database by yourself.

4. Advanced functions

After deploying the most basic data access (including indexing), some advanced functions can also be implemented.

  (1) SQL language is a universal operating language for databases, so a SQL parser is needed to parse SQL commands into corresponding ISAM operations.

  (2) Database connection (join) refers to the establishment of a connection relationship between two tables in the database through "foreign keys". You need to optimize this operation.

  (3) Database transaction (transaction) refers to a series of database operations in batches. As long as one step fails, the entire operation will be unsuccessful. Therefore, it is necessary to have an "operation log" so that the operation can be rolled back when it fails.

  (4) Backup mechanism: Save a copy of the database.

  (5) Remote operation: Allows users to operate the database on different machines through TCP/IP protocol.


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn