search

Home  >  Q&A  >  body text

Limitations before a table can be sharded or partitioned

I am new to database system design. After reading many articles, I am really confused what is the limit we should have 1 table without sharding or partitioning. I know it's really difficult to provide a general answer, things depend on factors like

But when someone asks this question

If the number of rows is less than a million, and the row size increases by thousands, the choice is simple. But things get trickier when the selection involves millions or billions of rows.

Note: I didn't mention the delay number in the question. please Answer based on the number of delays you are comfortable with. Also, we are talking about structured data.

I'm not sure, but I can add 3 specific questions:

NOTE: Throughout this question, it is assumed that we will choose SQL solution. Also, if the use case provided doesn't make logical sense, ignore it. The aim is to acquire numerical knowledge.

Can anyone help me understand what the benchmark is? Any real numbers from the project you're currently working on will show that for a large database with so many queries, this is the latency that is observed. Anything that can help me justify the number of select tables for a certain number of queries for a specific latency.

P粉190883225P粉190883225317 days ago425

reply all(1)I'll reply

  • P粉401901266

    P粉4019012662024-01-17 09:55:18

    Some answers for MySQL. Since all databases are subject to disk space, network latency, etc. other engines may be similar.

    • Regardless of the number of rows, a "point query" (getting a row using a suitable index) takes several milliseconds.
    • It is possible to write a SELECT that takes hours or even days to run. So you need to understand if the query is pathological like this. (I think this is an example of high "latency".)
    • "Sharding" is required when you cannot sustain the required number of writes on a single server.
    • Large reads can be scaled "infinitely" by using replication and sending reads to replicas.
    • PARTITIONing (especially in MySQL) has few uses. More details: Partition
    • INDEX is very important for performance.
    • For data warehouse applications, building and maintaining "summary tables" is critical for large-scale performance. (Some other engines have some built-in tools.)
    • Inserting a million rows per day is not a problem. (Of course, some schema designs may cause this problem.) Rule of thumb: 100/sec may not be a problem; 1000/sec may be possible; after that it gets harder. More aboutHigh-speed ingest
    • Network latency mainly depends on the distance between the client and the server. It takes more than 200 milliseconds to reach the other side of the Earth. On the other hand, if the client and server are in the same building, the latency will be less than 1 millisecond. If on the other hand you are referring to how long it takes to run a query, then here are some rules of thumb: 10ms for a simple query that needs to hit the HDD disk; 1ms for an SSD.
    • UUIDs and hashes are very detrimental to performance if the data is too large to be cached in RAM.
    • I didn't mention the read/write ratio because I prefer to judge reading and writing independently.
    • "Ten thousand reads per second" is difficult to achieve; I think few applications really need this. Or they can find a better way to achieve the same goal. How quickly can a user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
    • (My opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. so what? Some benchmarks show that when you have more than a few hundred active connections, throughput stalls and latency tends to infinity. so what. Capturing actual queries after the application has been running for a while is probably the best baseline. But its uses are still limited.
    • Almost always a single table is better than a split table (multiple tables; partition; shard). If you have specific examples, we can discuss the pros and cons of table design.
    • Row size and data type - Large columns (TEXT/BLOB/JSON) are stored "unlogged", thus [potentially] causing additional disk hits. Disk hits are the most expensive part of any query.
    • Active Queries - After a few dozen times, queries will conflict with each other. (Imagine a grocery store with lots of shoppers pushing shopping carts - "too many" shoppers and everyone taking a long time to finish.)

    When you get into large databases, they come in a few different types; each has some different characteristics.

    • Data warehouse (sensors, logs, etc.) - appended to the "end" of tables; summary tables for efficient "reporting"; huge "fact" tables (with optional chunked archive); certain "dimension tables".
    • Search (products, web pages, etc.) - EAV in question; full text is often useful.
    • Banking, Order Processing - This is very important for ACID functionality and the need to process transactions.
    • Media (Images and Videos) -- How to store huge objects while making searches (etc.) reasonably fast.
    • 'Find nearest' - requires a 2D index, SPATIAL or some technique here

    reply
    0
  • Cancelreply