But there are also a few caveats
Excel is a versatile application having grown far beyond its early versions as simply a spreadsheet solution. Employed as a record keeper, address book, forecasting tool, and much more, many people even use Excel in ways it was never intended.
If you use Excel a lot at home or in the office, you know that sometimes the Excel files can quickly become unwieldy due to the sheer number of records you are working with.
Luckily, Excel has built-in functions to help you find and remove duplicate records. Unfortunately, there are a few caveats to using these functions, so be careful or you may unknowingly delete records you didn’t intend to remove. Also, both the methods below instantly remove duplicates without letting you see what was removed.
I’ll also mention a way to highlight the rows that are duplicate first, so that you can see which ones will be removed by the functions before you run them. You have to use a custom Conditional Formatting rule in order to highlight a row that is entirely duplicate.
Remove Duplicates Function
Suppose that you use Excel to keep track of addresses and you suspect that you have duplicate records. Look at the example Excel worksheet below:
Notice that the “Jones” record appears twice. To remove such duplicate records, click on the Data tab on the Ribbon and locate the Remove Duplicates function under the Data Tools section. Click on Remove Duplicates and a new window opens.
Here you have to make a decision based on whether you use heading labels at the tops of your columns. If you do, select the option labeled My Data Has Headers. If you don’t use heading labels, you will use Excel’s standard column designations, such as Column A, Column B, etc.
For this example, we will choose column A only and click the OK button. The option window closes and Excel removes the second “Jones” record.
Of, course this was just a simple example. Any address records you keep using Excel are likely to be much more complicated. Suppose, for example, you have an address file that looks like this.
Notice that although there are three “Jones” records, only two are identical. If we used the procedures above to remove duplicate records, only one “Jones” entry would remain. In this case, we need to extend our decision criteria to include both first and last names found in Columns A and B respectively.
To do this, once again click on the Data tab on the Ribbon and then click on Remove Duplicates. This time, when the options window pops up, choose columns A and B. Click the OK button and notice that this time Excel removed only one of the “Mary Jones” records.
This is because we told Excel to remove duplicates by matching records based on Columns A and B rather than just Column A. The more columns you choose, the more criteria need to be met before Excel will consider a record to be a duplicate. Choose all the columns if you want to remove rows that are completely duplicate.
Excel will give you a message telling you how many duplicates were removed. It won’t, however, show you which rows were deleted! Scroll down to the last section to see how to highlight the duplicate rows first before running this function.
Advanced Filter Method
The second way to remove duplicates is to use the advanced filter option. First, select all the data in the sheet. Next, on the Data tab in the ribbon, click on Advanced in the Sort & Filter section.
In the dialog that pops up, make sure to check the Unique records only checkbox.
You can either filter the list in-place or you can copy the non-duplicate items to another part of the same spreadsheet. For some odd reason, you can’t copy the data to another sheet. If you want it on another sheet, first choose a location on the current sheet and then cut and paste that data into a new sheet.
With this method, you don’t even get a message stating how many rows were removed. The rows are removed and that’s it.
Highlight Duplicate Rows in Excel
If you want to see which records are duplicate before you remove them, you have to do a bit of manual work. Unfortunately, Excel doesn’t have a way to highlight rows that are entirely duplicate. It has a feature under conditional formatting that highlights duplicate cells, but this article is about duplicate rows.
The first thing you’ll need to do is add a formula in a column to the right of your set of data. The formula is simple: just concatenate all the columns for that row together.
= A1 & B1 & C1 & D1 & E1
In my example below, I have data in columns A thru F. However, the first column is an ID number, so I exclude that from my formula below. Make sure to include all the columns that have data you want to check for duplicates on.
I put that formula into column H and then dragged it down for all of my rows. This formula simply combines all the data in each column as one big piece of text. Now, skip over a couple of more columns and enter the following formula:
=COUNTIF($H$1:$H$34, $H1) > 1
Here we are using the COUNTIF function and the first parameter is the set of data we want to look at. For me, this was column H (which has the combine data formula) from row 1 to 34. It’s also a good idea to get rid of the header row before doing this.
You’ll also want to make sure you use the dollar sign ($) in front of the letter and number. If you have 1000 rows of data and your combined row formula is in column F, for example, your formula would look like this instead:
=COUNTIF($F$1:$F$1000, $F1) > 1
The second parameter only has the dollar sign in front of the column letter so that is locked, but we do not want to lock the row number. Again, you’ll drag this down for all your rows of data. It should look like this and the duplicate rows should have TRUE in them.
Now, let’s highlight the rows that have TRUE in them as those are the duplicate rows. First, select the entire worksheet of data by clicking on the little triangle at the top left intersection of rows and columns. Now go to the Home tab, then click on Conditional Formatting and click on New Rule.
In the dialog, click on Use a formula to determine which cells to format.
In the box under Format values where this formula is true:, enter the following formula, replacing P with your column that has the TRUE or FALSE values. Make sure to include the dollar sign in front of the column letter.
Once you have done that, click on Format and click on the Fill tab. Pick a color and that will be used to highlight the entire duplicate row. Click OK and you should now see the duplicate rows are highlighted.
If this didn’t work for you, start over and do it again slowly. It has to be done exactly right in order for all this to work. If you miss a single $ symbol along the way, it won’t work properly.
Caveats with Removing Duplicate Records
There are, of course, a few problems with letting Excel automatically remove duplicate records for you. First, you have to be careful of choosing too few or too many columns for Excel to use as the criteria for identifying duplicate records.
Too few and you may inadvertently delete records you need. Too many or including an identifier column by accident and no duplicates will be found.
Second, Excel always assumes that the first unique record it comes across is the master record. Any subsequent records are assumed to be duplicates. This is a problem if, for example, you failed to amend an address of one of the people in your file but instead created a new record.
If the new (correct) address record appears after the old (out-of-date) record, Excel will assume that the first (out-of-date) record to be the master and delete any subsequent records it finds. This is why you have to be careful how liberally or conservatively you let Excel decide what is or is not a duplicate record.
For those cases, you should use the highlight duplicate method I wrote about and manually delete the appropriate duplicate record.
Finally, Excel doesn’t ask you to verify whether you really want to delete a record. Using the parameters you choose (columns), the process is completely automated. This can be a dangerous thing when you have a huge number of records and you trust that the decisions you made were correct and allow Excel to automatically remove the duplicate records for you.
Also, be sure to check out our previous article on deleting blank lines in Excel. Enjoy!