BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
yfan777
Obsidian | Level 7

Hi, I have a data set to deduplicate, I have problem summarizing a complete deduplicating rules, but here’s a few records from it with how to deduplicate.

If any value of status1_date, status1, status2, status3, or status4 are different under the same ID, then all records should be kept.

For ID 1, rec_no 2 should be kept, because if all variable values under the same ID are the same, then we don’t keep the “NA” result.

For ID 2, rec_no 4 should be kept, because Result_date2 should be later then status_date.

For ID 3, rec_no 5 and 6 should be kept because status 2 are different.

For ID 4, rec_no 7 and 8 should be kept, because status 2 are different.

For ID 5, no need to dedup.

For ID 6, keep rec_no 10 and 11, because status 2 are different.

rec_no

ID

status1_date

status1

status2

status3

status4

Result_date1

Result_date2

Result

To keep

1

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

NA

 

2

1

10/22/2013

1

7

No

7

25-Oct-13

25-Oct-13

A1

x

3

2

2/25/2014

1

7

Unknown

0

27-Feb-14

15-Jan-14

NA

 

4

2

2/25/2014

1

7

Unknown

0

27-Feb-14

27-Feb-14

A1

x

5

3

2/25/2014

1

0

Unknown

0

25-Feb-14

15-Jan-14

NA

x

6

3

2/25/2014

1

7

Unknown

0

25-Feb-14

27-Feb-14

A1

x

7

4

5/14/2014

5

0

No

0

 

 

NA

x

8

4

5/14/2014

5

7

No

0

 

 

NA

x

9

5

11/20/2013

5

7

No

1

 

 

NA

x

10

6

5/14/2014

5

0

No

0

 

 

A1

x

11

6

5/14/2014

5

7

No

0

 

 

A1

x

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
yfan777
Obsidian | Level 7

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

View solution in original post

4 REPLIES 4
LinusH
Tourmaline | Level 20

What other values for "Result" can there be?

First sort your data according to id and Result DESC (if NA is the "highest" value).

Then just proc sort nodupkey using your change tracking variables in the BY statement.

Data never sleeps
yfan777
Obsidian | Level 7

Thank you for the reply.

 

The values for the "Result" are going to be formatted to "Normal", "Abnormal", and "No result" after the duplicating. I've tried the measures you suggested, it worked for part of the records, but not for those like ID #2.

 

I'm thinking probably I have to separate the data set to several sub data sets to dedup. Then merge them back together after the dedupping.

Patrick
Opal | Level 21

I believe you're mixing validation rules with de-duping. 

 

May be in a first step remove all records which don't comply with your validation rule and only then de-dupe.

 

If it's only the date thing then you could have a where clause in your Proc Sort and remove the records there.

yfan777
Obsidian | Level 7

Hi guys,

 

Just an FYI of how I ended up dedupping.

 

I've been trying macro, array, do loop, statements of DO+WHILE/UNTIL/LEAVE/CONTINUE, but I could not make it. So I then first transposed duplicated variables to calculate an only value for those "once true, always true" values. For the rest of the other variables, I proc sort_ed by prioritized variables and select the first record for each unique ID.

 

Thank you.

 

YingKer from Los Angeles

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1143 views
  • 2 likes
  • 3 in conversation