Home >Database >Oracle >oracle data deduplication

oracle data deduplication

WBOY
WBOYOriginal
2023-05-18 09:32:071442browse

As enterprise data continues to grow, duplicate data has become an important issue in database management. In Oracle database, duplicate data will lead to inaccurate query results, consume storage space and affect database performance. Therefore, deduplication is necessary.

This article will introduce several methods to delete duplicate data in Oracle database.

Method 1: Using subqueries and grouping

Before deleting duplicate data, we first need to understand what duplicate data is. In Oracle database, two or more records are duplicates if they have all the same columns.

The following is a sample table containing duplicate data:

CREATE TABLE employee(
emp_id NUMBER(6),
first_name VARCHAR2(50),
last_name VARCHAR2(50),
dept_id NUMBER(4)
);

INSERT INTO employee(emp_id, first_name, last_name, dept_id) 
VALUES(1, 'John', 'Doe', 101);

INSERT INTO employee(emp_id, first_name, last_name, dept_id) 
VALUES(2, 'Jane', 'Doe', 102);

INSERT INTO employee(emp_id, first_name, last_name, dept_id) 
VALUES(3, 'John', 'Doe', 101);

INSERT INTO employee(emp_id, first_name, last_name, dept_id) 
VALUES(4, 'Bob', 'Smith', 103);

If we want to remove duplicate data and only retain one record for each employee, we can use the following SQL query statement:

DELETE FROM employee
WHERE emp_id IN 
  (SELECT emp_id
   FROM (SELECT emp_id, 
                ROW_NUMBER() OVER (PARTITION BY first_name, last_name, dept_id ORDER BY emp_id) rn
         FROM employee)
   WHERE rn <> 1);

This SQL statement uses a subquery that uses the ROW_NUMBER function to identify the first row of each employee. Then it deletes all remaining rows.

PARTITION BY statement is used to group rows in each department, and ORDER BY statement sorts rows in emp_id order. After executing the ROW_NUMBER function, we get the following results:

EMP_ID | FIRST_NAME | LAST_NAME | DEPT_ID | RN
-------|------------|-----------|---------|-----
     1 | John       | Doe       |     101 |  1
     2 | Jane       | Doe       |     102 |  1
     3 | John       | Doe       |     101 |  2
     4 | Bob        | Smith     |     103 |  1

Here we can see that in the same department, John Doe is in the 1st and 3rd positions, which means there are two John Doe records . By removing all rows where rn is not equal to 1, we can remove duplicate data and keep one row for each employee.

Method 2: Use a temporary table

Another method is to use a temporary table, which stores the data we need to retain. We can use the following SQL query statement:

CREATE TABLE temp_employee AS 
SELECT DISTINCT emp_id, first_name, last_name, dept_id
FROM employee;

This statement will select the unique emp_id, first_name, last_name and dept_id from the employee table and insert them into a new table called temp_employee.

Now we can delete all the rows in the employee table and move the rows in the temp_employee table back to the employee table using the following SQL statement:

DELETE FROM employee;

INSERT INTO employee(emp_id, first_name, last_name, dept_id) 
SELECT emp_id, first_name, last_name, dept_id
FROM temp_employee;

This will delete all the rows from the employee table , and insert rows from the temp_employee table into the employee table. Now we have removed all duplicate records and retained one row for each employee.

Method 3: Using CTE and ROW_NUMBER function

This is another method of using the ROW_NUMBER function, but it uses a common expression (CTE). The following SQL query statement can be used to remove duplicate data:

WITH emp AS(
  SELECT emp_id, first_name, last_name, dept_id, ROW_NUMBER() OVER(PARTITION BY first_name, last_name, dept_id ORDER BY emp_id) rn
  FROM employee
)
DELETE FROM emp
WHERE rn > 1;

This statement uses the general expression emp, which includes all the records we need to delete and identifies the first record in each group. It then uses the DELETE statement to delete the remaining rows in all groups.

Conclusion

In Oracle database, it is very important to delete duplicate data. Duplicate data affects database performance, wastes storage space, and leads to inaccurate query results. This article explains several ways to remove duplicate data, including using subqueries and grouping, using temporary tables, and using the CTE and ROW_NUMBER functions. No matter which method you choose, be sure to back up your data before deleting records, just in case.

The above is the detailed content of oracle data deduplication. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:oracle change tablespaceNext article:oracle change tablespace