BookmarkSubscribeRSS Feed
HB
Barite | Level 11 HB
Barite | Level 11

I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques.  Lol. The entire data set is pairs of records with duplicate s_ids.  

 

Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set.  Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:

 

data have;
input s_id c_id $ mylastname $ want $ record_num;
datalines;
111	Q11	JONES	Y	1
111	Q11	JONES	N	2
222	Q21	DOE	Y	3
222	Q21	SMITH	N	4
333	Q31	FRANKLIN	Y	5
333	Q32	MADISON	Y	6
444	Q41	POTTER	Y	7
444	Q42	POTTER	Y	8
555	.	CLYDE	Y	9
555	.	CLYDE	N	10
666	.	PITT	Y	11
666	.	JOLIE	Y	12
777	Q71	LIBERTY	Y	13
777	.	LIBERTY	N	14
888	.	TIMON	Y	15
888	Q81	PUUMBA	Y	16
;
run;

For this purpose, i have added the fields WANT and RECORD_NUM. 

 

Walking through the pairs of records:

I want record 1 but not record 2.  Those two records have the same s_id, the same c_id, and the same lastname.  I just need one of them.  I'll match this bin on s_id.

I want record 3 but not record 4.  Those two records have the same s_id and the same c_id.  c_id trumps lastname. I'll match on this bin c_id.

I want 5, 6, 7 and 8.  c_id's are all different and trump whatever is going on in last name.  I'll also match this bin on c_id.

I want 9 but not 10.  No c_id to work with but the same lastname.  We'll skip 10. I'll match this on s_id.

I'd like both 11 and 12.  Again no c_id to work with but different last names.  I might match on an s_id-name key. Don't know yet.

I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.

I'd like 15.  Match on s_id.

I'd like 16.   Match on c_id. 

 

 

I decided I could sort the records and maybe use the properties of first. and last. to bin them out.  But I fell down doing it. 

 

proc sort data = have;
	by s_id c_id mylastname;
run;


data values;
	set have;
	by s_id c_id mylastname;
	uidf = first.s_id;
	uidl = last.s_id;
	cidf = first.c_id; 
	cidl = last.c_id; 
	namef = first.mylastname;
	namel = last.mylastname;
run;

Dataset VALUES (made above, shown below) shows what I am trying to say:

 

s_id	c_id	mylastname	want	record_num	uidf	uidl	cidf	cidl	namef	namel
111	Q11	JONES		Y	1		1	0	1	0	1	0
111	Q11	JONES		N	2		0	1	0	1	0	1
222	Q21	DOE		Y	3		1	0	1	0	1	1
222	Q21	SMITH		N	4		0	1	0	1	1	1
333	Q31	FRANKLIN	Y	5		1	0	1	1	1	1
333	Q32	MADISON		Y	6		0	1	1	1	1	1
444	Q41	POTTER		Y	7		1	0	1	1	1	1
444	Q42	POTTER		Y	8		0	1	1	1	1	1
555		CLYDE		Y	9		1	0	1	0	1	0
555		CLYDE		N	10		0	1	0	1	0	1
666		JOLIE		Y	12		1	0	1	0	1	1
666		PITT		Y	11		0	1	0	1	1	1
777		LIBERTY		N	14		1	0	1	1	1	1
777	Q71	LIBERTY		Y	13		0	1	1	1	1	1
888		TIMON		Y	15		1	0	1	1	1	1
888	Q81	PUUMBA		Y	16		0	1	1	1	1	1

I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc .  So I started down that road:

 

data jones_doe potter_franklin_madison clyde;
	set have;
		by s_id c_id mylastname;

		*grabbing the first record of a pair where s_id and c_id are the same;
		if 
			not missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 0 
				then output jones_doe;
				* gets 1,3  leaves 2,4   match on s_id;

		* grabbing both records of a pair if s_id is the same and c_id is different;
		if 
			(not missing(c_id)
			and	first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 1) 
			or
			(not missing(c_id)
			and first.s_id = 0 
			and last.s_id = 1 
			and first.c_id = 1 
			and last.c_id = 1) 
				then output potter_franklin_madison;
				* gets 5,6,7,8  also picks up 13,16 but might be better to do them another way    match on c_id;

		*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
		if 
			missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.mylastname = 1 
			and last.mylastname = 0 
				then output clyde;
				* gets 9  leaves 10   match on s_id;

run;

 

This actually works and then it doesn't.  It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.

 

But then I fall down.  I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.

 

So....

can you help?

 

Should I abandon this method and do something else/  What else? 

Can I tweak this method and yet make it work?  How to tweak?

Have i been completely obtuse in explaining what i am trying to do and you need more info?

 

 

 

3 REPLIES 3
Shmuel
Garnet | Level 18

Without testing it seems that you miss ELSE between IF statements:

data want;
 set have;
      by s_id c_id ...;
      if <conditions-1> then ...; ELSE
      if <conditions-2> then ...; ELSE
      if <conditions-3> then ...;
         else do; ...???... ; end;
run;
      
SwissC
Obsidian | Level 7

Not sure if you can do it in one step, but I tried this and it worked. 

 

Only goes to one bin but it is a start

PROC SORT data=have;
  BY s_id mylastname descending c_id;
RUN;

DATA have0;  
  SET have;
  BY s_id mylastname descending c_id;
  RETAIN c_id1;
  IF first.mylastname THEN c_id1="   ";
  IF c_id ne "" THEN c_id1=c_id;
RUN;

PROC SORT data=have0;
  BY s_id c_id1 mylastname;
RUN;

DATA have1;
  SET have0;
  WHERE c_id1 ne "";
  BY s_id c_id1;
  IF first.c_id1;
RUN;

DATA have2;
  SET have0;
  WHERE c_id1="";
  BY s_id c_id1 mylastname;
  IF first.mylastname;
RUN;

DATA want;
  SET have1 have2;
RUN;
HB
Barite | Level 11 HB
Barite | Level 11

@SwissC That gets close, I don't know why, but it does. Let me modify that, run it on the real data, and get back to you.  I think I can match your have1 on c_id (I'd prefer to do the jones case on s_id but c_id may work) and I'll try your have2 on s_id-name key. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 421 views
  • 0 likes
  • 3 in conversation