SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

data set to dedulicate

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 8
Accepted Solution

data set to dedulicate

[ Edited ]

Hi, I have a data set to deduplicate, I have problem summarizing a complete deduplicating rules, but here’s a few records from it with how to deduplicate.

If any value of status1_date, status1, status2, status3, or status4 are different under the same ID, then all records should be kept.

For ID 1, rec_no 2 should be kept, because if all variable values under the same ID are the same, then we don’t keep the “NA” result.

For ID 2, rec_no 4 should be kept, because Result_date2 should be later then status_date.

For ID 3, rec_no 5 and 6 should be kept because status 2 are different.

For ID 4, rec_no 7 and 8 should be kept, because status 2 are different.

For ID 5, no need to dedup.

For ID 6, keep rec_no 10 and 11, because status 2 are different.

rec_no

ID

status1_date

status1

status2

status3

status4

Result_date1

Result_date2

Result

To keep

1

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

NA

 

2

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

A1

x

3

2

2/25/2014

1

7

Unknown

0

27-Feb-14

15-Jan-14

NA

 

4

2

2/25/2014

1

7

Unknown

0

27-Feb-14

27-Feb-14

A1

x

5

3

2/25/2014

1

0

Unknown

0

25-Feb-14

15-Jan-14

NA

x

6

3

2/25/2014

1

7

Unknown

0

25-Feb-14

27-Feb-14

A1

x

7

4

5/14/2014

5

0

No

0

 

 

NA

x

8

4

5/14/2014

5

7

No

0

 

 

NA

x

9

5

11/20/2013

5

7

No

1

 

 

NA

x

10

6

5/14/2014

5

0

No

0

 

 

A1

x

11

6

5/14/2014

5

7

No

0

 

 

A1

x

 

Thank you.


Accepted Solutions
Solution
‎10-29-2015 01:35 PM
Occasional Contributor
Posts: 8

Re: data set to dedulicate

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

View solution in original post


All Replies
Super User
Posts: 5,257

Re: data set to dedulicate

What other values for "Result" can there be?

First sort your data according to id and Result DESC (if NA is the "highest" value).

Then just proc sort nodupkey using your change tracking variables in the BY statement.

Data never sleeps
Occasional Contributor
Posts: 8

Re: data set to dedulicate

Thank you for the reply.

 

The values for the "Result" are going to be formatted to "Normal", "Abnormal", and "No result" after the duplicating. I've tried the measures you suggested, it worked for part of the records, but not for those like ID #2.

 

I'm thinking probably I have to separate the data set to several sub data sets to dedup. Then merge them back together after the dedupping.

Respected Advisor
Posts: 3,892

Re: data set to dedulicate

I believe you're mixing validation rules with de-duping. 

 

May be in a first step remove all records which don't comply with your validation rule and only then de-dupe.

 

If it's only the date thing then you could have a where clause in your Proc Sort and remove the records there.

Solution
‎10-29-2015 01:35 PM
Occasional Contributor
Posts: 8

Re: data set to dedulicate

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 424 views
  • 2 likes
  • 3 in conversation