I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques. Lol. The entire data set is pairs of records with duplicate s_ids.
Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set. Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:
data have; input s_id c_id $ mylastname $ want $ record_num; datalines; 111 Q11 JONES Y 1 111 Q11 JONES N 2 222 Q21 DOE Y 3 222 Q21 SMITH N 4 333 Q31 FRANKLIN Y 5 333 Q32 MADISON Y 6 444 Q41 POTTER Y 7 444 Q42 POTTER Y 8 555 . CLYDE Y 9 555 . CLYDE N 10 666 . PITT Y 11 666 . JOLIE Y 12 777 Q71 LIBERTY Y 13 777 . LIBERTY N 14 888 . TIMON Y 15 888 Q81 PUUMBA Y 16 ; run;
For this purpose, i have added the fields WANT and RECORD_NUM.
Walking through the pairs of records:
I want record 1 but not record 2. Those two records have the same s_id, the same c_id, and the same lastname. I just need one of them. I'll match this bin on s_id.
I want record 3 but not record 4. Those two records have the same s_id and the same c_id. c_id trumps lastname. I'll match on this bin c_id.
I want 5, 6, 7 and 8. c_id's are all different and trump whatever is going on in last name. I'll also match this bin on c_id.
I want 9 but not 10. No c_id to work with but the same lastname. We'll skip 10. I'll match this on s_id.
I'd like both 11 and 12. Again no c_id to work with but different last names. I might match on an s_id-name key. Don't know yet.
I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.
I'd like 15. Match on s_id.
I'd like 16. Match on c_id.
I decided I could sort the records and maybe use the properties of first. and last. to bin them out. But I fell down doing it.
proc sort data = have; by s_id c_id mylastname; run; data values; set have; by s_id c_id mylastname; uidf = first.s_id; uidl = last.s_id; cidf = first.c_id; cidl = last.c_id; namef = first.mylastname; namel = last.mylastname; run;
Dataset VALUES (made above, shown below) shows what I am trying to say:
s_id c_id mylastname want record_num uidf uidl cidf cidl namef namel 111 Q11 JONES Y 1 1 0 1 0 1 0 111 Q11 JONES N 2 0 1 0 1 0 1 222 Q21 DOE Y 3 1 0 1 0 1 1 222 Q21 SMITH N 4 0 1 0 1 1 1 333 Q31 FRANKLIN Y 5 1 0 1 1 1 1 333 Q32 MADISON Y 6 0 1 1 1 1 1 444 Q41 POTTER Y 7 1 0 1 1 1 1 444 Q42 POTTER Y 8 0 1 1 1 1 1 555 CLYDE Y 9 1 0 1 0 1 0 555 CLYDE N 10 0 1 0 1 0 1 666 JOLIE Y 12 1 0 1 0 1 1 666 PITT Y 11 0 1 0 1 1 1 777 LIBERTY N 14 1 0 1 1 1 1 777 Q71 LIBERTY Y 13 0 1 1 1 1 1 888 TIMON Y 15 1 0 1 1 1 1 888 Q81 PUUMBA Y 16 0 1 1 1 1 1
I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc . So I started down that road:
data jones_doe potter_franklin_madison clyde;
set have;
by s_id c_id mylastname;
*grabbing the first record of a pair where s_id and c_id are the same;
if
not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 0
then output jones_doe;
* gets 1,3 leaves 2,4 match on s_id;
* grabbing both records of a pair if s_id is the same and c_id is different;
if
(not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 1)
or
(not missing(c_id)
and first.s_id = 0
and last.s_id = 1
and first.c_id = 1
and last.c_id = 1)
then output potter_franklin_madison;
* gets 5,6,7,8 also picks up 13,16 but might be better to do them another way match on c_id;
*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
if
missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.mylastname = 1
and last.mylastname = 0
then output clyde;
* gets 9 leaves 10 match on s_id;
run;
This actually works and then it doesn't. It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.
But then I fall down. I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.
So....
can you help?
Should I abandon this method and do something else/ What else?
Can I tweak this method and yet make it work? How to tweak?
Have i been completely obtuse in explaining what i am trying to do and you need more info?
Without testing it seems that you miss ELSE between IF statements:
data want;
set have;
by s_id c_id ...;
if <conditions-1> then ...; ELSE
if <conditions-2> then ...; ELSE
if <conditions-3> then ...;
else do; ...???... ; end;
run;
Not sure if you can do it in one step, but I tried this and it worked.
Only goes to one bin but it is a start
PROC SORT data=have;
BY s_id mylastname descending c_id;
RUN;
DATA have0;
SET have;
BY s_id mylastname descending c_id;
RETAIN c_id1;
IF first.mylastname THEN c_id1=" ";
IF c_id ne "" THEN c_id1=c_id;
RUN;
PROC SORT data=have0;
BY s_id c_id1 mylastname;
RUN;
DATA have1;
SET have0;
WHERE c_id1 ne "";
BY s_id c_id1;
IF first.c_id1;
RUN;
DATA have2;
SET have0;
WHERE c_id1="";
BY s_id c_id1 mylastname;
IF first.mylastname;
RUN;
DATA want;
SET have1 have2;
RUN;
@SwissC That gets close, I don't know why, but it does. Let me modify that, run it on the real data, and get back to you. I think I can match your have1 on c_id (I'd prefer to do the jones case on s_id but c_id may work) and I'll try your have2 on s_id-name key.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.