BookmarkSubscribeRSS Feed
HB
Barite | Level 11 HB
Barite | Level 11

I have a data set with a number of observations that have duplicate unique values. Yes, I know- duplicate uniques.  Lol. The entire data set is pairs of records with duplicate s_ids.  

 

Anyway, I need to sort the out the duplicates in such a way that I can match them to another data set.  Nice little separate bins would help my brain. The relevant fields are an s_id, a c_id and name as shown below:

 

data have;
input s_id c_id $ mylastname $ want $ record_num;
datalines;
111	Q11	JONES	Y	1
111	Q11	JONES	N	2
222	Q21	DOE	Y	3
222	Q21	SMITH	N	4
333	Q31	FRANKLIN	Y	5
333	Q32	MADISON	Y	6
444	Q41	POTTER	Y	7
444	Q42	POTTER	Y	8
555	.	CLYDE	Y	9
555	.	CLYDE	N	10
666	.	PITT	Y	11
666	.	JOLIE	Y	12
777	Q71	LIBERTY	Y	13
777	.	LIBERTY	N	14
888	.	TIMON	Y	15
888	Q81	PUUMBA	Y	16
;
run;

For this purpose, i have added the fields WANT and RECORD_NUM. 

 

Walking through the pairs of records:

I want record 1 but not record 2.  Those two records have the same s_id, the same c_id, and the same lastname.  I just need one of them.  I'll match this bin on s_id.

I want record 3 but not record 4.  Those two records have the same s_id and the same c_id.  c_id trumps lastname. I'll match on this bin c_id.

I want 5, 6, 7 and 8.  c_id's are all different and trump whatever is going on in last name.  I'll also match this bin on c_id.

I want 9 but not 10.  No c_id to work with but the same lastname.  We'll skip 10. I'll match this on s_id.

I'd like both 11 and 12.  Again no c_id to work with but different last names.  I might match on an s_id-name key. Don't know yet.

I'd like 13 and not 14. They are likely the same person but 13 has a c_id and is a more complete record.

I'd like 15.  Match on s_id.

I'd like 16.   Match on c_id. 

 

 

I decided I could sort the records and maybe use the properties of first. and last. to bin them out.  But I fell down doing it. 

 

proc sort data = have;
	by s_id c_id mylastname;
run;


data values;
	set have;
	by s_id c_id mylastname;
	uidf = first.s_id;
	uidl = last.s_id;
	cidf = first.c_id; 
	cidl = last.c_id; 
	namef = first.mylastname;
	namel = last.mylastname;
run;

Dataset VALUES (made above, shown below) shows what I am trying to say:

 

s_id	c_id	mylastname	want	record_num	uidf	uidl	cidf	cidl	namef	namel
111	Q11	JONES		Y	1		1	0	1	0	1	0
111	Q11	JONES		N	2		0	1	0	1	0	1
222	Q21	DOE		Y	3		1	0	1	0	1	1
222	Q21	SMITH		N	4		0	1	0	1	1	1
333	Q31	FRANKLIN	Y	5		1	0	1	1	1	1
333	Q32	MADISON		Y	6		0	1	1	1	1	1
444	Q41	POTTER		Y	7		1	0	1	1	1	1
444	Q42	POTTER		Y	8		0	1	1	1	1	1
555		CLYDE		Y	9		1	0	1	0	1	0
555		CLYDE		N	10		0	1	0	1	0	1
666		JOLIE		Y	12		1	0	1	0	1	1
666		PITT		Y	11		0	1	0	1	1	1
777		LIBERTY		N	14		1	0	1	1	1	1
777	Q71	LIBERTY		Y	13		0	1	1	1	1	1
888		TIMON		Y	15		1	0	1	1	1	1
888	Q81	PUUMBA		Y	16		0	1	1	1	1	1

I then thought, okay i can pick up the 1,0 1,0 records and the 1,0 1,1 records and skip the 1,0 0,1 records etc .  So I started down that road:

 

data jones_doe potter_franklin_madison clyde;
	set have;
		by s_id c_id mylastname;

		*grabbing the first record of a pair where s_id and c_id are the same;
		if 
			not missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 0 
				then output jones_doe;
				* gets 1,3  leaves 2,4   match on s_id;

		* grabbing both records of a pair if s_id is the same and c_id is different;
		if 
			(not missing(c_id)
			and	first.s_id = 1 
			and last.s_id = 0 
			and first.c_id = 1 
			and last.c_id = 1) 
			or
			(not missing(c_id)
			and first.s_id = 0 
			and last.s_id = 1 
			and first.c_id = 1 
			and last.c_id = 1) 
				then output potter_franklin_madison;
				* gets 5,6,7,8  also picks up 13,16 but might be better to do them another way    match on c_id;

		*grabbing the first record of a pair where s_id and lname are the same and but c_id is missing;
		if 
			missing(c_id)
			and first.s_id = 1 
			and last.s_id = 0 
			and first.mylastname = 1 
			and last.mylastname = 0 
				then output clyde;
				* gets 9  leaves 10   match on s_id;

run;

 

This actually works and then it doesn't.  It gets me a bin with 1 and 3 in it (jones_doe) skipping 2 and 4, a bin with 5, 6, 7, and 8 (potter_franklin_madison), and a bin with 9 (clyde) skipping 10.

 

But then I fall down.  I can't seem to pick up 11, 12, 13, 15, and 16 in such a way that 14 doesn't come along for the ride.

 

So....

can you help?

 

Should I abandon this method and do something else/  What else? 

Can I tweak this method and yet make it work?  How to tweak?

Have i been completely obtuse in explaining what i am trying to do and you need more info?

 

 

 

3 REPLIES 3
Shmuel
Garnet | Level 18

Without testing it seems that you miss ELSE between IF statements:

data want;
 set have;
      by s_id c_id ...;
      if <conditions-1> then ...; ELSE
      if <conditions-2> then ...; ELSE
      if <conditions-3> then ...;
         else do; ...???... ; end;
run;
      
SwissC
Obsidian | Level 7

Not sure if you can do it in one step, but I tried this and it worked. 

 

Only goes to one bin but it is a start

PROC SORT data=have;
  BY s_id mylastname descending c_id;
RUN;

DATA have0;  
  SET have;
  BY s_id mylastname descending c_id;
  RETAIN c_id1;
  IF first.mylastname THEN c_id1="   ";
  IF c_id ne "" THEN c_id1=c_id;
RUN;

PROC SORT data=have0;
  BY s_id c_id1 mylastname;
RUN;

DATA have1;
  SET have0;
  WHERE c_id1 ne "";
  BY s_id c_id1;
  IF first.c_id1;
RUN;

DATA have2;
  SET have0;
  WHERE c_id1="";
  BY s_id c_id1 mylastname;
  IF first.mylastname;
RUN;

DATA want;
  SET have1 have2;
RUN;
HB
Barite | Level 11 HB
Barite | Level 11

@SwissC That gets close, I don't know why, but it does. Let me modify that, run it on the real data, and get back to you.  I think I can match your have1 on c_id (I'd prefer to do the jones case on s_id but c_id may work) and I'll try your have2 on s_id-name key. 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 986 views
  • 0 likes
  • 3 in conversation