Hi, I have a data set to deduplicate, I have problem summarizing a complete deduplicating rules, but here’s a few records from it with how to deduplicate.
If any value of status1_date, status1, status2, status3, or status4 are different under the same ID, then all records should be kept.
For ID 1, rec_no 2 should be kept, because if all variable values under the same ID are the same, then we don’t keep the “NA” result.
For ID 2, rec_no 4 should be kept, because Result_date2 should be later then status_date.
For ID 3, rec_no 5 and 6 should be kept because status 2 are different.
For ID 4, rec_no 7 and 8 should be kept, because status 2 are different.
For ID 5, no need to dedup.
For ID 6, keep rec_no 10 and 11, because status 2 are different.
rec_no
ID
status1_date
status1
status2
status3
status4
Result_date1
Result_date2
Result
To keep
1
1
10/22/2013
1
7
No
7
25-Oct-13
25-Oct-13
NA
2
1
10/22/2013
1
7
No
7
25-Oct-13
25-Oct-13
A1
x
3
2
2/25/2014
1
7
Unknown
0
27-Feb-14
15-Jan-14
NA
4
2
2/25/2014
1
7
Unknown
0
27-Feb-14
27-Feb-14
A1
x
5
3
2/25/2014
1
0
Unknown
0
25-Feb-14
15-Jan-14
NA
x
6
3
2/25/2014
1
7
Unknown
0
25-Feb-14
27-Feb-14
A1
x
7
4
5/14/2014
5
0
No
0
NA
x
8
4
5/14/2014
5
7
No
0
NA
x
9
5
11/20/2013
5
7
No
1
NA
x
10
6
5/14/2014
5
0
No
0
A1
x
11
6
5/14/2014
5
7
No
0
A1
x
Thank you.
... View more