Limitations before a table can be sharded or partitioned

Question

I'm new to database system design. After reading many articles, I am really confused what is the limit we should have 1 table without sharding or partitioning. I know it's really difficult to provide a general answer, things depend on factors like row size, data type (string, blob, etc.), number of active queries, what kind of queries, index reread/rewrite, expected latency, but when someone asks this question if With 1 billion pieces of data and millions of rows added every day, what do you do? For such a large database, the latency for 4 read, 1 write, and 2 update queries needs to be less than 5 milliseconds. If you only have 10

P粉401901266 · Answer

Some answers for MySQL. Since all databases are subject to disk space, network latency, etc. other engines may be similar.

Regardless of the number of rows, a "point query" (getting a row using a suitable index) takes several milliseconds.
It is possible to write a SELECT that takes hours or even days to run. So you need to understand if the query is pathological like this. (I think this is an example of high "latency".)
"Sharding" is required when you cannot sustain the required number of writes on a single server.
Large reads can be scaled "infinitely" by using replication and sending reads to replicas.
PARTITIONing (especially in MySQL) has few uses. More details: Partition
INDEX is very important for performance.
For data warehouse applications, building and maintaining "summary tables" is critical for large-scale performance. (Some other engines have some built-in tools.)
Inserting a million rows per day is not a problem. (Of course, some schema designs may cause this problem.) Rule of thumb: 100/sec may not be a problem; 1000/sec may be possible; after that it gets harder. More aboutHigh-speed ingest
Network latency mainly depends on the distance between the client and the server. It takes more than 200 milliseconds to reach the other side of the Earth. On the other hand, if the client and server are in the same building, the latency will be less than 1 millisecond. If on the other hand you are referring to how long it takes to run a query, then here are some rules of thumb: 10ms for a simple query that needs to hit the HDD disk; 1ms for an SSD.
UUIDs and hashes are very detrimental to performance if the data is too large to be cached in RAM.
I didn't mention the read/write ratio because I prefer to judge reading and writing independently.
"Ten thousand reads per second" is difficult to achieve; I think few applications really need this. Or they can find a better way to achieve the same goal. How quickly can a user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(My opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. so what? Some benchmarks show that when you have more than a few hundred active connections, throughput stalls and latency tends to infinity. so what. Capturing actual queries after the application has been running for a while is probably the best baseline. But its uses are still limited.
Almost always a single table is better than a split table (multiple tables; partition; shard). If you have specific examples, we can discuss the pros and cons of table design.
Row size and data type - Large columns (TEXT/BLOB/JSON) are stored "unlogged", thus [potentially] causing additional disk hits. Disk hits are the most expensive part of any query.
Active Queries - After a few dozen times, queries will conflict with each other. (Imagine a grocery store with lots of shoppers pushing shopping carts - "too many" shoppers and everyone taking a long time to finish.)

When you get into large databases, they come in a few different types; each has some different characteristics.

Data warehouse (sensors, logs, etc.) - appended to the "end" of tables; summary tables for efficient "reporting"; huge "fact" tables (with optional chunked archive); certain "dimension tables".
Search (products, web pages, etc.) - EAV in question; full text is often useful.
Banking, Order Processing - This is very important for ACID functionality and the need to process transactions.
Media (Images and Videos) -- How to store huge objects while making searches (etc.) reasonably fast.
'Find nearest' - requires a 2D index, SPATIAL or some technique here

Limitations before a table can be sharded or partitioned

reply all(1)I'll reply