Duplicates in Excel datasets are often a result of data entry errors, merging datasets, or importing data from multiple sources. The duplicates cause inaccurate results in data analysis and reporting.
You can use Power Query to remove duplicates in your dataset, ensuring it is clean, accurate, and ready for analysis and reporting.
I will show you how to remove duplicates from a single-column and multiple-column dataset using Power Query.
Use Power Query to Remove Duplicates from a One-Column Dataset
Consider the single-column dataset below with 10 records having duplicates: ‘John Smith’ and ‘Alice Johnson’ each appear twice, while ‘Sarah Taylor’ appears thrice.
You want to use Power Query to remove duplicates from the dataset, ensuring it has unique values.
Here’s how to do it:
- Select the dataset.
- Open the Data tab and click the From Table/Range option on the Get & Transform Data group.
The above step opens the Create Table dialog box with the dataset’s reference filled in and the ‘My table has headers’ checkbox selected.
- Click OK on the Create Table dialog box.
The above step converts the data range to a table (if it wasn’t a table already) and loads the data onto the Power Query Editor.
- You can use either of the options below to remove duplicates from the dataset.
Option #1: Use the Remove Duplicates Option on the Home Tab
On the Power Query Editor, click the Home tab, open the Remove Rows drop-down list on the Reduce Rows group, and select the Remove Duplicates option.
Option #2: Use the Remove Duplicates Option on the Shortcut Menu
Right-click the dataset’s header and select the Remove Duplicates option on the shortcut menu.
The above step removes duplicates from the dataset, reducing it from the original ten values (including duplicates) to six unique values.
You can open the View tab and select the ‘Column distribution’ option on the Data Preview group to see the count of unique values in the column.
Power Query visually displays the count of unique and distinct values in the column.
- On the Home tab, click the Close & Load button.
The above step closes the Power Query Editor and loads the dataset with unique values onto the destination worksheet.
Use Power Query to Remove Duplicates from a Multiple-Column Dataset
Consider the multi-column dataset below containing duplicate records for ‘John Smith,’ ‘Alice Johnson,’ and ‘Sarah Taylor.’
Note: In a multi-column dataset, a duplicate record is a row where all column values exactly match those in another row.
You want to use Power Query to remove duplicate records from the dataset, ensuring it has unique records.
Here’s how to do it:
- Follow steps 1 to 3 in the previous section titled, ‘Use Power Query to Remove Duplicates From Single-column Dataset.’
The steps above convert the data range into a table (if it wasn’t a table) and load it into the Power Query Editor with the first column preselected.
- Press and hold down the CTRL key and click the headers of the second and third columns.
- Do step 4 in the previous section to remove duplicates.
The above step removes duplicates from the dataset leaving unique records.
- On the Home tab, click the Close & Load button.
The above step closes the Power Query Editor and loads the dataset with unique records onto the destination worksheet.
Comparison of Excel’s and Power Query’s Remove Duplicates Options
Although you can use Excel’s built-in Remove Duplicates option to remove duplicates from a dataset, Power Query’s Remove Duplicates option is often a better choice because of the reasons in the comparison table below.
Excel’s Remove Duplicates Option | Power Query’s Remove Duplicates Option |
---|---|
Directly modifies the original data, risking data loss without a backup. | Preserves the original data; duplicates are removed only in Power Query. |
Requires manually reapplying the Remove Duplicates option after each dataset update. | Once you have set up the query, it only requires refreshing when the source data is updated. Ideal for dynamic datasets. |
Performance may be degraded when dealing with large datasets. | Efficiently processes large datasets without freezing or slowing down Excel. |
Lacks a history or log of operations, making changes harder to trace. | Provides a clear, step-by-step record of all transformations in the ‘Applied Steps’ pane, making it easy to review or modify. |
I have shown you how to remove duplicates from a single-column and multiple-column dataset using Power Query. I hope you found the tutorial helpful.
Other Excel articles you may also like: