BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
CynthiaFan
Obsidian | Level 7

Hi, I have a data set to deduplicate, I have problem summarizing a complete deduplicating rules, but here’s a few records from it with how to deduplicate.

If any value of status1_date, status1, status2, status3, or status4 are different under the same ID, then all records should be kept.

For ID 1, rec_no 2 should be kept, because if all variable values under the same ID are the same, then we don’t keep the “NA” result.

For ID 2, rec_no 4 should be kept, because Result_date2 should be later then status_date.

For ID 3, rec_no 5 and 6 should be kept because status 2 are different.

For ID 4, rec_no 7 and 8 should be kept, because status 2 are different.

For ID 5, no need to dedup.

For ID 6, keep rec_no 10 and 11, because status 2 are different.

rec_no

ID

status1_date

status1

status2

status3

status4

Result_date1

Result_date2

Result

To keep

1

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

NA

 

2

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

A1

x

3

2

2/25/2014

1

7

Unknown

0

27-Feb-14

15-Jan-14

NA

 

4

2

2/25/2014

1

7

Unknown

0

27-Feb-14

27-Feb-14

A1

x

5

3

2/25/2014

1

0

Unknown

0

25-Feb-14

15-Jan-14

NA

x

6

3

2/25/2014

1

7

Unknown

0

25-Feb-14

27-Feb-14

A1

x

7

4

5/14/2014

5

0

No

0

 

 

NA

x

8

4

5/14/2014

5

7

No

0

 

 

NA

x

9

5

11/20/2013

5

7

No

1

 

 

NA

x

10

6

5/14/2014

5

0

No

0

 

 

A1

x

11

6

5/14/2014

5

7

No

0

 

 

A1

x

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
CynthiaFan
Obsidian | Level 7

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

View solution in original post

4 REPLIES 4
LinusH
Tourmaline | Level 20

What other values for "Result" can there be?

First sort your data according to id and Result DESC (if NA is the "highest" value).

Then just proc sort nodupkey using your change tracking variables in the BY statement.

Data never sleeps
CynthiaFan
Obsidian | Level 7

Thank you for the reply.

 

The values for the "Result" are going to be formatted to "Normal", "Abnormal", and "No result" after the duplicating. I've tried the measures you suggested, it worked for part of the records, but not for those like ID #2.

 

I'm thinking probably I have to separate the data set to several sub data sets to dedup. Then merge them back together after the dedupping.

Patrick
Opal | Level 21

I believe you're mixing validation rules with de-duping. 

 

May be in a first step remove all records which don't comply with your validation rule and only then de-dupe.

 

If it's only the date thing then you could have a where clause in your Proc Sort and remove the records there.

CynthiaFan
Obsidian | Level 7

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1227 views
  • 2 likes
  • 3 in conversation