Hi all, I'm working on a de-duplication of a dataset and have run into a quandary. Here's how the data looks, enough to get the idea anyway:
Comp # ID# name ID#2 name2
1 24 Dave 26 Dave
2 26 Dave 24 Dave
etc.
The point of this dataset is to investigate whether 2 distinct IDs are actually the same person.
There are thousands of rows. Essentially, what I'm looking for is a way to remove that 2nd line (comp# 2). It should be removed because the comparison between ID 24 and 26 has already been made -- in comp# 1. The tricky part (for me anyway!) is that in that first case, the id# 24 was in the "id#" field, and 26 was in the id#2 field. So somehow I need to make a comparison across id1 and id2, so that who ever is checking out these potential duplicates does not investigate the same pair twice, even though the ID#s are in different places (it's the same comparison).
I can think of how to do this in Excel where I can check values against a range, but in SAS, no clue.. any help would be greatly appreciated!
Thank you.
Use operator.
Nevermind, just figured it out. I'll simply create a concat variable assembled according to the min and max of the two ID values. That will put two pairs in the same order each time, then I'll sort and keep the first only....
Yes, that sounds like it will work! If it does, mark your question as "answered" so that others don't spend time trying to come up with the same solution. Better yet, post your code and simply mark your own response as being the correct answer.
Actually, I spoke too soon. Many of my ID variables have letters in them, so SAS will not process the MIN or MAX parts.
Is there a function to alpha sort two strings (or #s), then select the first or last of that sorting?
Thanks.
It will make it easier if you give more complete example of your data. Include have and want.
Hope this makes it clearer:
Have:
comp# ID1 ID2
1 a1234 b1235
2 b1235 a1234
Want:
comp# ID1 ID2
1 a1234 b1235
The 2nd record has been removed, because the pair (a1234 and b1234) already exist, despite those values existing in different variables (id1 and id2).
if the ID1 and ID2 were always numeric (e.g. 1234 and 1235), I would make a concatenated variable of
concvar = compress(min(id1,id2) ||"|"|| max(id1,id2));
Which would result in both of the original rows having the concvar of "1234|1235". I would then sort by this, and keep only the first concvar.
But since my ID1 and ID2 vars have alphabetical characters in them, I cannot use MIN and MAX. Ideally I'd like a function that would compute "min" as the first element in an alphabetically sorted set.
Use operator.
Thank you, data_null_;!! It works perfectly.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.