BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
avak
Calcite | Level 5

I’m working with a national databank with tens of thousands of observations, and I'm having issues with duplicate unique identifiers and the variables I want to keep. I came across the example below just by looking at a proc print of limited observations, but I have tens of thousands of observations since this is a national databank, so it’s not really feasible to look through all of them. I’d like to combine TBI = 1 and abuse = 1 in the same observation. How do I look for these discrepancies without combing through them by hand? Or at least a way to narrow down the observations I have to look at so I can then edit them by hand? I'm using SAS 9.4.

 

Screen Shot 2019-07-11 at 6.15.26 PM.png

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can use PROC SUMMARY (also known as MEANS) to find the MAX of a number of variables.

proc summary nway data=have ;
  by id ;
  var var1 var2 var3 ;
  output out=want max=;
run;

View solution in original post

9 REPLIES 9
heffo
Pyrite | Level 9

Totally missed your problem. One more go. So, you want to find duplicate rows that also have TBI =1 and Abuse = 1? 

 

data have;
	input INC_KEY TBI Abuse;
datalines;
12345 0 1 
12346 0 1
12346 1 1
12347 0 1
12347 1 0
;;; run; proc sql; create table want as
select *, count(INC_KEY) as duplicates from have group by INC_KEY having duplicates > 1 and TBI = 1 and abuse = 1; quit;

NB! This will only give you the rows where all of these things are true. So, this will only give you one row.

INC_KEY TBI Abuse duplicates

12346   1   1     2

 

You can also do a subquery:

proc sql;
	create table want as 
	select *, count(INC_KEY) as duplicates 
	from have 
	where INC_KEY in (select distinct INC_KEY from have where TBI = 1 and abuse = 1)
	group by INC_KEY
	having duplicates > 1;	
quit;

INC_KEY TBI Abuse duplicates

12346   0   1     2

12346   1   1     2

 

Tom
Super User Tom
Super User

So your variables are 0/1?

Just take the MAX value.

avak
Calcite | Level 5

Yes, they are yes/no variables. 1 = yes, 0 = no. How do I select the max values? 

Tom
Super User Tom
Super User

You can use PROC SUMMARY (also known as MEANS) to find the MAX of a number of variables.

proc summary nway data=have ;
  by id ;
  var var1 var2 var3 ;
  output out=want max=;
run;
avak
Calcite | Level 5

My variables are only 0/1 (yes/no). If I have duplicate unique identifiers (inc_key) but differing variables, is there a way to delete the duplicates and set the variables to the max value for any of the given observations of that individual inc_key? 

Tom
Super User Tom
Super User

Isn't that what the code I just posted is doing? 

 

You will need to adjust the code to match your dataset and variable names.  Since you didn't post any sample data there was no way for me to post exact code.

 

avak
Calcite | Level 5

Yes! Sorry, I didn't run the code correctly. Thank you so much! 

Reeza
Super User

You have a few ways to handle this type of issue, first is to understand why it's happening - usually reasons are things recorded over time so sometimes it's best to just take the latest version of the records. 

Here's one method I think works as well

 

data want;
update have(obs=0) have;
by INC_KEY;
run;

PS. In the future please paste your data as text, when you paste it as a picture we have to type it out to test any solutions. So this solution is untested. 

avak
Calcite | Level 5

It's happening because I'm merging several databases from the National Trauma Data Bank, and I'm creating variables off of ICD-9 codes. For every ICD-9 code, they list it under duplicate unique identifiers (the inc_key variable). For example, one observation will include a certain ICD-9 code, but the next one might not, even though both observations have the same inc_key. So I'm struggling to get rid of the duplicates without accidentally deleting important data. 

Sorry about not including text, will do that next time! Thanks!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1098 views
  • 2 likes
  • 4 in conversation