Re: Sorting observations with duplicate values into various bins

HB · Posted 08-13-2019 12:36 PM

I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques. Lol. The entire data set is pairs of records with duplicate s_ids.

Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set. Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:

data have;
input s_id c_id $ mylastname $ want $ record_num;
datalines;
111	Q11	JONES	Y	1
111	Q11	JONES	N	2
222	Q21	DOE	Y	3
222	Q21	SMITH	N	4
333	Q31	FRANKLIN	Y	5
333	Q32	MADISON	Y	6
444	Q41	POTTER	Y	7
444	Q42	POTTER	Y	8
555	.	CLYDE	Y	9
555	.	CLYDE	N	10
666	.	PITT	Y	11
666	.	JOLIE	Y	12
777	Q71	LIBERTY	Y	13
777	.	LIBERTY	N	14
888	.	TIMON	Y	15
888	Q81	PUUMBA	Y	16
;
run;

For this purpose, i have added the fields WANT and RECORD_NUM.

Walking through the pairs of records:

I want record 1 but not record 2. Those two records have the same s_id, the same c_id, and the same lastname. I just need one of them. I'll match this bin on s_id.

I want record 3 but not record 4. Those two records have the same s_id and the same c_id. c_id trumps lastname. I'll match on this bin c_id.

I want 5, 6, 7 and 8. c_id's are all different and trump whatever is going on in last name. I'll also match this bin on c_id.

I want 9 but not 10. No c_id to work with but the same lastname. We'll skip 10. I'll match this on s_id.

I'd like both 11 and 12. Again no c_id to work with but different last names. I might match on an s_id-name key. Don't know yet.

I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.

I'd like 15. Match on s_id.

I'd like 16. Match on c_id.

I decided I could sort the records and maybe use the properties of first. and last. to bin them out. But I fell down doing it.

proc sort data = have;
	by s_id c_id mylastname;
run;


data values;
	set have;
	by s_id c_id mylastname;
	uidf = first.s_id;
	uidl = last.s_id;
	cidf = first.c_id; 
	cidl = last.c_id; 
	namef = first.mylastname;
	namel = last.mylastname;
run;

Dataset VALUES (made above, shown below) shows what I am trying to say:

s_id	c_id	mylastname	want	record_num	uidf	uidl	cidf	cidl	namef	namel
111	Q11	JONES		Y	1		1	0	1	0	1	0
111	Q11	JONES		N	2		0	1	0	1	0	1
222	Q21	DOE		Y	3		1	0	1	0	1	1
222	Q21	SMITH		N	4		0	1	0	1	1	1
333	Q31	FRANKLIN	Y	5		1	0	1	1	1	1
333	Q32	MADISON		Y	6		0	1	1	1	1	1
444	Q41	POTTER		Y	7		1	0	1	1	1	1
444	Q42	POTTER		Y	8		0	1	1	1	1	1
555		CLYDE		Y	9		1	0	1	0	1	0
555		CLYDE		N	10		0	1	0	1	0	1
666		JOLIE		Y	12		1	0	1	0	1	1
666		PITT		Y	11		0	1	0	1	1	1
777		LIBERTY		N	14		1	0	1	1	1	1
777	Q71	LIBERTY		Y	13		0	1	1	1	1	1
888		TIMON		Y	15		1	0	1	1	1	1
888	Q81	PUUMBA		Y	16		0	1	1	1	1	1

I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc . So I started down that road:

data jones_doe potter_franklin_madison clyde;
	set have;
		by s_id c_id mylastname;

		*grabbing the first record of a pair where s_id and c_id are the same;
		if 
			not missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 0 
				then output jones_doe;
				* gets 1,3  leaves 2,4   match on s_id;

		* grabbing both records of a pair if s_id is the same and c_id is different;
		if 
			(not missing(c_id)
			and	first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 1) 
			or
			(not missing(c_id)
			and first.s_id = 0 
			and last.s_id = 1 
			and first.c_id = 1 
			and last.c_id = 1) 
				then output potter_franklin_madison;
				* gets 5,6,7,8  also picks up 13,16 but might be better to do them another way    match on c_id;

		*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
		if 
			missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.mylastname = 1 
			and last.mylastname = 0 
				then output clyde;
				* gets 9  leaves 10   match on s_id;

run;

This actually works and then it doesn't. It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.

But then I fall down. I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.

So....

can you help?

Should I abandon this method and do something else/ What else?

Can I tweak this method and yet make it work? How to tweak?

Have i been completely obtuse in explaining what i am trying to do and you need more info?

Shmuel · Posted 08-13-2019 01:04 PM

Without testing it seems that you miss ELSE between IF statements:

data want;
 set have;
      by s_id c_id ...;
      if <conditions-1> then ...; ELSE
      if <conditions-2> then ...; ELSE
      if <conditions-3> then ...;
         else do; ...???... ; end;
run;

SwissC · Posted 08-13-2019 02:01 PM

Not sure if you can do it in one step, but I tried this and it worked.

Only goes to one bin but it is a start

PROC SORT data=have;
  BY s_id mylastname descending c_id;
RUN;

DATA have0;  
  SET have;
  BY s_id mylastname descending c_id;
  RETAIN c_id1;
  IF first.mylastname THEN c_id1="   ";
  IF c_id ne "" THEN c_id1=c_id;
RUN;

PROC SORT data=have0;
  BY s_id c_id1 mylastname;
RUN;

DATA have1;
  SET have0;
  WHERE c_id1 ne "";
  BY s_id c_id1;
  IF first.c_id1;
RUN;

DATA have2;
  SET have0;
  WHERE c_id1="";
  BY s_id c_id1 mylastname;
  IF first.mylastname;
RUN;

DATA want;
  SET have1 have2;
RUN;

HB · Posted 08-13-2019 03:08 PM

@SwissC That gets close, I don't know why, but it does. Let me modify that, run it on the real data, and get back to you. I think I can match your have1 on c_id (I'd prefer to do the jones case on s_id but c_id may work) and I'll try your have2 on s_id-name key.

Sorting observations with duplicate values into various bins