I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques. Lol. The entire data set is pairs of records with duplicate s_ids.
Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set. Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:
data have;
input s_id c_id $ mylastname $ want $ record_num;
datalines;
111 Q11 JONES Y 1
111 Q11 JONES N 2
222 Q21 DOE Y 3
222 Q21 SMITH N 4
333 Q31 FRANKLIN Y 5
333 Q32 MADISON Y 6
444 Q41 POTTER Y 7
444 Q42 POTTER Y 8
555 . CLYDE Y 9
555 . CLYDE N 10
666 . PITT Y 11
666 . JOLIE Y 12
777 Q71 LIBERTY Y 13
777 . LIBERTY N 14
888 . TIMON Y 15
888 Q81 PUUMBA Y 16
;
run;
For this purpose, i have added the fields WANT and RECORD_NUM.
Walking through the pairs of records:
I want record 1 but not record 2. Those two records have the same s_id, the same c_id, and the same lastname. I just need one of them. I'll match this bin on s_id.
I want record 3 but not record 4. Those two records have the same s_id and the same c_id. c_id trumps lastname. I'll match on this bin c_id.
I want 5, 6, 7 and 8. c_id's are all different and trump whatever is going on in last name. I'll also match this bin on c_id.
I want 9 but not 10. No c_id to work with but the same lastname. We'll skip 10. I'll match this on s_id.
I'd like both 11 and 12. Again no c_id to work with but different last names. I might match on an s_id-name key. Don't know yet.
I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.
I'd like 15. Match on s_id.
I'd like 16. Match on c_id.
I decided I could sort the records and maybe use the properties of first. and last. to bin them out. But I fell down doing it.
proc sort data = have;
by s_id c_id mylastname;
run;
data values;
set have;
by s_id c_id mylastname;
uidf = first.s_id;
uidl = last.s_id;
cidf = first.c_id;
cidl = last.c_id;
namef = first.mylastname;
namel = last.mylastname;
run;
Dataset VALUES (made above, shown below) shows what I am trying to say:
s_id c_id mylastname want record_num uidf uidl cidf cidl namef namel
111 Q11 JONES Y 1 1 0 1 0 1 0
111 Q11 JONES N 2 0 1 0 1 0 1
222 Q21 DOE Y 3 1 0 1 0 1 1
222 Q21 SMITH N 4 0 1 0 1 1 1
333 Q31 FRANKLIN Y 5 1 0 1 1 1 1
333 Q32 MADISON Y 6 0 1 1 1 1 1
444 Q41 POTTER Y 7 1 0 1 1 1 1
444 Q42 POTTER Y 8 0 1 1 1 1 1
555 CLYDE Y 9 1 0 1 0 1 0
555 CLYDE N 10 0 1 0 1 0 1
666 JOLIE Y 12 1 0 1 0 1 1
666 PITT Y 11 0 1 0 1 1 1
777 LIBERTY N 14 1 0 1 1 1 1
777 Q71 LIBERTY Y 13 0 1 1 1 1 1
888 TIMON Y 15 1 0 1 1 1 1
888 Q81 PUUMBA Y 16 0 1 1 1 1 1
I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc . So I started down that road:
data jones_doe potter_franklin_madison clyde;
set have;
by s_id c_id mylastname;
*grabbing the first record of a pair where s_id and c_id are the same;
if
not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 0
then output jones_doe;
* gets 1,3 leaves 2,4 match on s_id;
* grabbing both records of a pair if s_id is the same and c_id is different;
if
(not missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.c_id = 1
and last.c_id = 1)
or
(not missing(c_id)
and first.s_id = 0
and last.s_id = 1
and first.c_id = 1
and last.c_id = 1)
then output potter_franklin_madison;
* gets 5,6,7,8 also picks up 13,16 but might be better to do them another way match on c_id;
*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
if
missing(c_id)
and first.s_id = 1
and last.s_id = 0
and first.mylastname = 1
and last.mylastname = 0
then output clyde;
* gets 9 leaves 10 match on s_id;
run;
This actually works and then it doesn't. It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.
But then I fall down. I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.
So....
can you help?
Should I abandon this method and do something else/ What else?
Can I tweak this method and yet make it work? How to tweak?
Have i been completely obtuse in explaining what i am trying to do and you need more info?
... View more