Home  >  Q&A  >  body text

Select some coordinates from a large set of coordinates that appear to be evenly distributed over the area

I have a MySQL table with region and latitude/longitude location columns. Each area has many locations, such as 20.000. Is there a way to select just a few, like 100, that would appear to be evenly distributed on the map?

Distribution does not have to be perfect, query speed is more important. If this is not possible using MySQL directly, a very fast algorithm can be used to select evenly distributed locations.

Thanks in advance.

EDIT: Answering some requests in the comments. The data doesn't have anything to work with, it's just the area and location coordinates, for example:

+-------+--------------+----------+-----------+------------+--------+--------+
| id    | area         | postcode | lat       | lon        | colour | size   |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens       |    10431 | 37.983917 | 23.7293599 | red    | big    |
| 16995 | Athens       |    11523 | 37.883917 | 23.8293599 | green  | medium |
| 16996 | Athens       |    10432 | 37.783917 | 23.7293599 | yellow | small  |
| 17000 | Thessaloniki |    54453 | 40.783917 | 22.7293599 | green  | small  |
+-------+--------------+----------+-----------+------------+--------+--------+

There are also some columns with characteristics, but these are only used for filtering.

I did try to get the nth rows at the same time and it seemed to work, albeit a bit slow

SET @a = 0;
select * from `locations` where (@a := @a + 1) % 200 = 0

Using random() can also be used, but it is also a bit slow.

Edit 2: It turns out that adding zip codes to a form is easy. With this, grouping by zip code seems to give a pleasing result. The only problem is, there are very large areas, around 3000 different postcodes, and getting just 100 of them might end up with many of them showing up in one place, so that might require further processing in PHP.

Edit3, answer @RickJames' questions in the comments so they are in one place:

  1. Please define "uniform distribution" - uniform distribution in latitude? Aren't two people "close" to each other? ETC.
    • "Uniformly distributed" is a poor choice of words. We just want to show some locations in the area, but not all in one place
  2. Is "area" a rectangle? hexagon? Or gerrymandering?
    • They can be thought of roughly as rectangles, but that doesn't really matter. I missed the important thing, we also need to show the location of multiple regions. Regions may be far apart from each other or adjacent (but not overlapping). In this case, we want to distribute 100 samples to various regions.
  3. Is "100 per region" fixed? Or it could be "about 100"
    • It's not fixed, it's about 100, but if it doesn't look good we can change it
  4. Is there an AUTO_INCRMENT id on the table? Is there a difference in numbers?
    • Yes, there is an AUTO_INCRMENT id and there can be gaps
  5. Did the question change from "100 per region" to "1 per zip code"?
    • No, the problem is still the same, "show 100 of each area, but not all in the same location", it doesn't matter how it is done
  6. What is the total number of rows and the required number of rows in the output?
    • The total number of rows depends on the region and standard, up to 40k for a region. If the total exceeds 1000, we want to display only a random 100. If 1000 or less we can just show everything
  7. Do I need a different example every time I run the query?
    • The same sample or different samples (even with the same standard) are OK
  8. Would you like to add a column to the table?
    • It's not up to me, but if I have a good argument then we can probably add a new column

P粉262113569P粉262113569258 days ago352

reply all(1)I'll reply

  • P粉982054449

    P粉9820544492024-02-27 00:20:37

    This is a method that satisfies the goal.

    1. Preprocess the table and create a new table to delete "duplicate" items.
    2. If the new table is small enough, a full scan of it may be fast enough.

    As for "duplicates", think of this as a rough way to discover that two items fall into the same position:

    SELECT ROUND(latitude * 5),
             ROUND(longitude * 3),
             MIN(id) AS id_to_keep
         FROM tbl
         GROUP BY 1,2

    "5" and "3" can be adjusted up (or down) to retain more (or fewer) ids. Because of the latitude/longitude arrangement, "5" and "3" are different; this ratio is probably true for most temperate regions. (Use the same amounts near the equator, use larger rations at higher latitudes.)

    There is a small flaw...two very close projects may cross the boundary created by ROUNDs.

    How many rows does the original table have? How many rows did the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )

    reply
    0
  • Cancelreply