I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques. Lol. The entire data set is pairs of records with duplicate s_ids.
Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set. Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:
data have; input s_id c_id $ mylastname $ want $ record_num; datalines; 111 Q11 JONES Y 1 111 Q11 JONES N 2 222 Q21 DOE Y 3 222 Q21 SMITH N 4 333 Q31 FRANKLIN Y 5 333 Q32 MADISON Y 6 444 Q41 POTTER Y 7 444 Q42 POTTER Y 8 555 . CLYDE Y 9 555 . CLYDE N 10 666 . PITT Y 11 666 . JOLIE Y 12 777 Q71 LIBERTY Y 13 777 . LIBERTY N 14 888 . TIMON Y 15 888 Q81 PUUMBA Y 16 ; run;
For this purpose, i have added the fields WANT and RECORD_NUM.
Walking through the pairs of records:
I want record 1 but not record 2. Those two records have the same s_id, the same c_id, and the same lastname. I just need one of them. I'll match this bin on s_id.
I want record 3 but not record 4. Those two records have the same s_id and the same c_id. c_id trumps lastname. I'll match on this bin c_id.
I want 5, 6, 7 and 8. c_id's are all different and trump whatever is going on in last name. I'll also match this bin on c_id.
I want 9 but not 10. No c_id to work with but the same lastname. We'll skip 10. I'll match this on s_id.
I'd like both 11 and 12. Again no c_id to work with but different last names. I might match on an s_id-name key. Don't know yet.
I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.
I'd like 15. Match on s_id.
I'd like 16. Match on c_id.
I decided I could sort the records and maybe use the properties of first. and last. to bin them out. But I fell down doing it.
proc sort data = have; by s_id c_id mylastname; run; data values; set have; by s_id c_id mylastname; uidf = first.s_id; uidl = last.s_id; cidf = first.c_id; cidl = last.c_id; namef = first.mylastname; namel = last.mylastname; run;
Dataset VALUES (made above, shown below) shows what I am trying to say:
s_id c_id mylastname want record_num uidf uidl cidf cidl namef namel 111 Q11 JONES Y 1 1 0 1 0 1 0 111 Q11 JONES N 2 0 1 0 1 0 1 222 Q21 DOE Y 3 1 0 1 0 1 1 222 Q21 SMITH N 4 0 1 0 1 1 1 333 Q31 FRANKLIN Y 5 1 0 1 1 1 1 333 Q32 MADISON Y 6 0 1 1 1 1 1 444 Q41 POTTER Y 7 1 0 1 1 1 1 444 Q42 POTTER Y 8 0 1 1 1 1 1 555 CLYDE Y 9 1 0 1 0 1 0 555 CLYDE N 10 0 1 0 1 0 1 666 JOLIE Y 12 1 0 1 0 1 1 666 PITT Y 11 0 1 0 1 1 1 777 LIBERTY N 14 1 0 1 1 1 1 777 Q71 LIBERTY Y 13 0 1 1 1 1 1 888 TIMON Y 15 1 0 1 1 1 1 888 Q81 PUUMBA Y 16 0 1 1 1 1 1
I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc . So I started down that road:
data jones_doe potter_franklin_madison clyde;
set have;
by s_id c_id mylastname;
*grabbing the first record of a pair where s_id and c_id are the same;
if
not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 0
then output jones_doe;
* gets 1,3 leaves 2,4 match on s_id;
* grabbing both records of a pair if s_id is the same and c_id is different;
if
(not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 1)
or
(not missing(c_id)
and first.s_id = 0
and last.s_id = 1
and first.c_id = 1
and last.c_id = 1)
then output potter_franklin_madison;
* gets 5,6,7,8 also picks up 13,16 but might be better to do them another way match on c_id;
*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
if
missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.mylastname = 1
and last.mylastname = 0
then output clyde;
* gets 9 leaves 10 match on s_id;
run;
This actually works and then it doesn't. It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.
But then I fall down. I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.
So....
can you help?
Should I abandon this method and do something else/ What else?
Can I tweak this method and yet make it work? How to tweak?
Have i been completely obtuse in explaining what i am trying to do and you need more info?
Without testing it seems that you miss ELSE between IF statements:
data want;
set have;
by s_id c_id ...;
if <conditions-1> then ...; ELSE
if <conditions-2> then ...; ELSE
if <conditions-3> then ...;
else do; ...???... ; end;
run;
Not sure if you can do it in one step, but I tried this and it worked.
Only goes to one bin but it is a start
PROC SORT data=have;
BY s_id mylastname descending c_id;
RUN;
DATA have0;
SET have;
BY s_id mylastname descending c_id;
RETAIN c_id1;
IF first.mylastname THEN c_id1=" ";
IF c_id ne "" THEN c_id1=c_id;
RUN;
PROC SORT data=have0;
BY s_id c_id1 mylastname;
RUN;
DATA have1;
SET have0;
WHERE c_id1 ne "";
BY s_id c_id1;
IF first.c_id1;
RUN;
DATA have2;
SET have0;
WHERE c_id1="";
BY s_id c_id1 mylastname;
IF first.mylastname;
RUN;
DATA want;
SET have1 have2;
RUN;
@SwissC That gets close, I don't know why, but it does. Let me modify that, run it on the real data, and get back to you. I think I can match your have1 on c_id (I'd prefer to do the jones case on s_id but c_id may work) and I'll try your have2 on s_id-name key.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.