DATA Step, Macro, Functions and more

Combining datasets on mismatching ID variables

Reply
Contributor
Posts: 23

Combining datasets on mismatching ID variables

Hi all, I'm trying to figure out a way to combine a dataset back onto itself, while merging on rows only when the ID variable doesn't match. For instance, let's say the dataset HAVE has 30 rows, with ID = 1 to 30, and a bunch of other variables. I want to merge HAVE back onto itself, so that the row with ID = 1 now becomes 29 rows, with each of the 29 taking on the values from ID = 2 to 30 for the other variables. Then, the row that originally had ID = 2 would take on the values from ID = 1 and then ID = 3 to 30. And so on.

 

Initially, I was able to do this with hashing:

data WANT; 
if _N_=1 then do;
	declare hash h(dataset: 'HAVE(rename=(ID=CompID A=CompA B=CompB))') ;
	declare hiter iter('h') ;
	h.defineKey('CompID') ;
	h.defineData('CompID','CompA','CompB') ;
	h.defineDone() ;
	call missing(CompID, CompA, CompB);
	end;
set HAVE ;
by ID ;
HashCheck = iter.first();
do while (HashCheck = 0);
	if ID^=CompID then do;
		output;
		end;
	HashCheck = iter.next();
	end;
run; 

That worked when ID and A and B were all numeric. However, then I needed to also bring on C and D (as CompC and CompD), which were character variables, and it no longer worked, with the error message:

ERROR: Type mismatch for data variable CompC at line 1237 column 5.
ERROR: Hash data set load failed at line 1237 column 5.
ERROR: DATA STEP Component Object failure.  Aborted during the EXECUTION phase.

I thought I might be able to do it with something kind of like a DoW-loop, or something similar where there was a second SET statement within a DO loop, but that wasn't working either. It would output all 29 observations for the first value of ID from the base dataset, but then nothing else.

 

I might just end up doing a CROSS JOIN or something like that in PROC SQL. However, it's been bugging me that I can't figure out a way to do it in a DATA step. So, I thought I would ask if anyone had a suggestion for how it could be done in a DATA step.

 

Thanks!

Super User
Posts: 5,257

Re: Combining datasets on mismatching ID variables

This kind of processing is much easier to acomplish in SQL. So why try to do it in a data step, is it just the challenge?
Data never sleeps
Contributor
Posts: 23

Re: Combining datasets on mismatching ID variables

Yeah, I guess. I feel like it's something that should be able to be done in a data step, right?

Respected Advisor
Posts: 3,124

Re: Combining datasets on mismatching ID variables

IMHO, you have already cheated by using Hash Smiley Wink. The old school data step will be something like:

data have;
	do id=1 to 30;
		output;
	end;
run;

data want;
	set have;
	nobs=nobs;

	do i=1 to nobs;
		set have (rename=id=id_b) nobs=nobs point=i;

		if id ne id_b then
			output;
	end;
run;
Regular Contributor
Posts: 234

Re: Combining datasets on mismatching ID variables

Haikuo, I did not quite understand the second  data step.  What is the logic of putting two set statements with same input data set (have)?  Can you explain little bit?  Thanks !

Respected Advisor
Posts: 3,124

Re: Combining datasets on mismatching ID variables

Before the Hash era, this would be one of ways to do Cartesian join using data step (if you have to). So for every obs from the first 'have', SAS goes through every single obs in the second 'have'. The first 'have' is the default sequential data step from top down, the second 'have' is direct access by using 'point='. Of course there are some other details involved, and frankly, without knowing your current knowledge structure, I have no idea how to start, it could involve lots of reading on your end.

 

Regular Contributor
Posts: 234

Re: Combining datasets on mismatching ID variables

Thanks, Haikuo. This is basically many-to-many merge. I also wonder how does the nobs after first "have" (nobs=nobs) gets its values after first set statement executes, since there is not nobs option in the first set statement. I know SAS statements are executed sequentially (executable statements).
Respected Advisor
Posts: 3,124

Re: Combining datasets on mismatching ID variables

it gets the value from second 'set' statement, which is done during the compiling stage. 'Set' statement is a kind of 'hybrid', half-executable and half-declarative, so long before the second 'set' is executed, 'nobs' has been set up with a value. There are many old docs/SUGI papers on this topic, I believe they are ready to be found for curious mind like you. Good Luck!

 

Haikuo

Contributor
Posts: 23

Re: Combining datasets on mismatching ID variables

Thanks Haikuo! I didn't know about the NOBS= or POINT= arguments, but after reading about them they look like they will be very useful to me.

 

Thanks again!

Respected Advisor
Posts: 3,124

Re: Combining datasets on mismatching ID variables

[ Edited ]

Yes, there are always more than one ways to skin a cat, especially true in SAS, however, in most cases, one or two solutions stand out as the most suitable. To be honest with you, in your case, Proc SQL is among the most suitable approaches, as Cartesian product is what SQL does the best.

 

data have;
	do id=1 to 30;
		output;
	end;
run;

proc sql;
	create table want as
		select * from have a, have(rename=id=id_b) b
			where a.id ne b.id_b;
quit;

 As for your Hash question, it turns out to be simple one. You will need to first define character variables such as CompC or CompB by the following:

	length CompC CompB $ 20;

of course you need to tweak the actual length to match your data. Call missing alone only produces numeric variables by default unless you define it as Char in advance.

 

Respected Advisor
Posts: 4,651

Re: Combining datasets on mismatching ID variables

Very insightful @Haikuo, I would only add that RENAME is not needed since SQL has its own renaming syntax

 

proc sql;
	create table want as
		select 
			a.*, 
			b.id as id_b 
		from 
			have as a cross join 
			have as b
			where a.id ne b.id;
quit;

   

PG
Super Contributor
Posts: 254

Re: Combining datasets on mismatching ID variables

[ Edited ]

It is still doable. Your description of what you want is not clear, at least for me. If small sample data set with desired output, if shown, will enthuse people to try out and get a suitable data step solution.

Ask a Question
Discussion stats
  • 11 replies
  • 374 views
  • 3 likes
  • 6 in conversation