How to **identify** duplicate rows in a sheet that have dissimilar values, such as multiple grades for the same student in the same subject from different teachers.

### Introduction

You might encounter a situation where you have duplicate data rows that you need to identify and deal with. This commonly occurs during **data collection cycles**, where multiple people might erroneously enter data about the same ‘thing’. Grades for students in a particular subject is a classic example.

Having realised that you have these **pseudo-duplicates**, where the majority of the row is the same but just one column/cell might be different, how do you use formulas to show you which duplicate rows need your attention automatically?

### Sorting to the rescue

It’s not as difficult as it might initially seem. Consider the rows below, which is a simplified version of what you might face.

Student | Name | Subject | Skill | Grade |
---|---|---|---|---|

1001 | Ann Q. Pace | English | Reading | B |

1001 | Ann Q. Pace | Maths | Numeracy | B |

1001 | Ann Q. Pace | English | Reading | A |

1002 | Bernard A. Hanson | English | Reading | C |

1003 | Adrienne Erickson | English | Reading | A+ |

1002 | Bernard A. Hanson | Maths | Numeracy | B |

1004 | Hakeem Faulkner | Maths | Numeracy | E |

1004 | Hakeem Faulkner | English | Reading | C |

1004 | Hakeem Faulkner | English | Reading | C- |

1003 | Adrienne Erickson | Maths | Numeracy | D |

Of course, with such a **small number of rows**, it’s easy to identify the two pairs of pseudo-duplicate entries (Rows 1 & 3, 8 & 9). But imagine if this dataset was for 200 students, with ten subjects and four skills in each subject. That’s **8,000 rows** already, which is **too many** to scan manually and accurately.

The first task is to ensure our data is sorted suitably. We need to order the data by **student**, then **subject** and finally **skill** (in that precedence). The data in the above it *almost* sorted, with just a few minor adjustments needed.

Student | Name | Subject | Skill | Grade |
---|---|---|---|---|

1001 | Ann Q. Pace | English | Reading | B |

1001 | Ann Q. Pace | English | Reading | A |

1001 | Ann Q. Pace | Maths | Numeracy | B |

1002 | Bernard A. Hanson | English | Reading | C |

1002 | Bernard A. Hanson | Maths | Numeracy | B |

1003 | Adrienne Erickson | English | Reading | A+ |

1003 | Adrienne Erickson | Maths | Numeracy | D |

1004 | Hakeem Faulkner | English | Reading | C |

1004 | Hakeem Faulkner | English | Reading | C- |

1004 | Hakeem Faulkner | Maths | Numeracy | E |

Once the data has sorted correctly, we can use two comparison formulas to identify which rows are about the same student/subject/skill, and which have dissimilar grades.

### Concatenating Keys

To compare our rows, we need to reduce them to two keys. The first needs to concatenate the student, subject and skill, and the second adds the grade to this mix.

To work out whether the current row is similar (same student, subject and skill) to the previous one, we need to compare the first of these keys. To work out if it has a different grade, we need to compare the second key to the previous row.

A duplicate row would have a **true** value for the first comparison (it is similar to the row before) and a **false** value for the second (the grade is not the same).

### Single Formula Solution

We can build a conditional statement to test this, which will show **true** if the row is a pseudo-duplicate (different grade) of the previous one. The formula can be simplified by just comparing the grade as the second part of the conditional test.

This outputs the following, correctly identifying the rows which are similar to the previous ones (apart from the grade). The sorts we implemented ensures that all the duplicates will be found.

Student | Name | Subject | Skill | Grade | Duplicate |
---|---|---|---|---|---|

1001 | Ann Q. Pace | English | Reading | B | |

1001 | Ann Q. Pace | English | Reading | A | TRUE |

1001 | Ann Q. Pace | Maths | Numeracy | B | |

1002 | Bernard A. Hanson | English | Reading | C | |

1002 | Bernard A. Hanson | Maths | Numeracy | B | |

1003 | Adrienne Erickson | English | Reading | A+ | |

1003 | Adrienne Erickson | Maths | Numeracy | D | |

1004 | Hakeem Faulkner | English | Reading | C | |

1004 | Hakeem Faulkner | English | Reading | C- | TRUE |

1004 | Hakeem Faulkner | Maths | Numeracy | E |

**Tagged **► Google, Sheet, Markbook

**Related **► Google Calendars on the Web (01 Sep 2018), Protecting Data with Google Drive (14 Aug 2018), Dealing with confidential information in schools (24 Jul 2018)