Crossposted on my blog
You can read it here
Our schema
create table "post" ( id SERIAL PRIMARY KEY, title VARCHAR(255) NOT NULL, content TEXT NOT NULL ); create table "user" ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL ) create table "post_like" ( id SERIAL PRIMARY KEY, post_id INTEGER NOT NULL REFERENCES post(id), user_id INTEGER NOT NULL REFERENCES user(id) )
Now we want to ensure that each user cannot like the same post more than once.
This can be prevented by:
- using a unique constraint on the pair post_id user_id columns of the post_like table.
- or removing the id column of the post_like table and use a composite primary key on post_id user_id
But, assuming we are at a point where duplicates are already there, we need to remove them.
Check if there are duplicates
select post_id, user_id, count(*) from post_like group by post_id, user_id having count(*) > 2 ;
| post_id | user_id | count | | ------- | ------- | ----- | | 3 | 2 | 2 |
This output tells us that user 2 has liked post 3 more than one time, specifically 2 times.
Remove duplicates
Now that we know that there are duplicates, we can remove them.
We split this process in two step:
- read duplicates
- remove duplicates (dry run)
- remove duplicates (real run)
Read duplicates
Transaction rollback
To test our queries without removing real data, until we are sure the query is correct, we use the transaction rollback feature.
By doing this our query will never be committed, is similar to the
"dry run" concept that you can find on other applications (like
rsync).
CTE
We use CTE because it provides a good DX.
With CTE, we can run a query, store the results in a temporary table, and then use the same table for subsequent queries.
This mental model is similar to what we usually do in coding by creating a temporary variable.The CTE syntax is
with <cte_name> as ( <query> ), <cte_name_2> as ( <query_2> -- here we can refernce <cte_name> ) <final_query> -- here we can refernce <cte_name> and <cte_name_2> </cte_name_2></cte_name></final_query></cte_name></query_2></cte_name_2></query></cte_name>
With both transaction and CTE, we can do the following:
begin; -- start transaction with duplicates_info as ( select row_number() over ( partition by post_id, user_id order by user_id ) as group_index, id, post_id, user_id from post_like ) select * from duplicates_info ; rollback; -- ends transaction discarding every changes to the database
| group_index | id | post_id | user_id | | ----------- | -- | ------- | ------- | | 1 | 1 | 1 | 1 | | 1 | 2 | 2 | 2 | | 1 | 3 | 3 | 2 | | 2 | 4 | 3 | 2 |
The latest row of results, where group_index is 2, means that this row is the second one in the group with post_id = 3 and user_id = 2.
What happens here with the syntax?
row_number() over (partition by ...) as group_index is a window function that, first group rows by the columns in the partition by clause, and then assigns a number to each row, based on the index of the row in the group.
partition is similar to group by, because it groups the rows by a common column, but if group by return only 1 row for each group, partition let us add new columns to the source table based on groups.
group_index is a column name alias, regular sql syntax.
Filter only duplicates
Now let's keep only items with group_index > 1, which means that the row is not the first one in the group, or in other words, it is a duplicate.
create table "post" ( id SERIAL PRIMARY KEY, title VARCHAR(255) NOT NULL, content TEXT NOT NULL ); create table "user" ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL ) create table "post_like" ( id SERIAL PRIMARY KEY, post_id INTEGER NOT NULL REFERENCES post(id), user_id INTEGER NOT NULL REFERENCES user(id) )
select post_id, user_id, count(*) from post_like group by post_id, user_id having count(*) > 2 ;
We need to remove only this row, with id 4.
Remove duplicates - dry run
Now rewite the final query so that we read from post_like table and not anymore from the cte duplicates_info.
We still use the cte duplicates_info to get the id of the duplicates.
| post_id | user_id | count | | ------- | ------- | ----- | | 3 | 2 | 2 |
We will see the records that we want to remove.
After we checked that they are correct, we swap select with delete.
with <cte_name> as ( <query> ), <cte_name_2> as ( <query_2> -- here we can refernce <cte_name> ) <final_query> -- here we can refernce <cte_name> and <cte_name_2> </cte_name_2></cte_name></final_query></cte_name></query_2></cte_name_2></query></cte_name>
This last query is what we finally want to execute.
But becuase we still have rollback statement, these chhanges are simulated, and not applied to the database.
Remove duplicates - real run
Finally we can remove the duplicates for real.
Here we use commit instead of rollback, so that the changes are applied to the database.
begin; -- start transaction with duplicates_info as ( select row_number() over ( partition by post_id, user_id order by user_id ) as group_index, id, post_id, user_id from post_like ) select * from duplicates_info ; rollback; -- ends transaction discarding every changes to the database
Final Code
| group_index | id | post_id | user_id | | ----------- | -- | ------- | ------- | | 1 | 1 | 1 | 1 | | 1 | 2 | 2 | 2 | | 1 | 3 | 3 | 2 | | 2 | 4 | 3 | 2 |
Conclusion
I write articles mainly to help future myself or to help the growth of tools I use in my work.
If this article was helpful to you leave a like.
Would you like me to talk about a particular topic?
Tell me in the comments !
The above is the detailed content of How to remove duplicates in Postgres SQL. For more information, please follow other related articles on the PHP Chinese website!

JavaScript core data types are consistent in browsers and Node.js, but are handled differently from the extra types. 1) The global object is window in the browser and global in Node.js. 2) Node.js' unique Buffer object, used to process binary data. 3) There are also differences in performance and time processing, and the code needs to be adjusted according to the environment.

JavaScriptusestwotypesofcomments:single-line(//)andmulti-line(//).1)Use//forquicknotesorsingle-lineexplanations.2)Use//forlongerexplanationsorcommentingoutblocksofcode.Commentsshouldexplainthe'why',notthe'what',andbeplacedabovetherelevantcodeforclari

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.

Choosing Python or JavaScript should be based on career development, learning curve and ecosystem: 1) Career development: Python is suitable for data science and back-end development, while JavaScript is suitable for front-end and full-stack development. 2) Learning curve: Python syntax is concise and suitable for beginners; JavaScript syntax is flexible. 3) Ecosystem: Python has rich scientific computing libraries, and JavaScript has a powerful front-end framework.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

WebStorm Mac version
Useful JavaScript development tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 Linux new version
SublimeText3 Linux latest version

Dreamweaver CS6
Visual web development tools
