Re: Combining datasets on mismatching ID variables

gfarkas · Posted 12-24-2015 03:22 PM

Hi all, I'm trying to figure out a way to combine a dataset back onto itself, while merging on rows only when the ID variable doesn't match. For instance, let's say the dataset HAVE has 30 rows, with ID = 1 to 30, and a bunch of other variables. I want to merge HAVE back onto itself, so that the row with ID = 1 now becomes 29 rows, with each of the 29 taking on the values from ID = 2 to 30 for the other variables. Then, the row that originally had ID = 2 would take on the values from ID = 1 and then ID = 3 to 30. And so on.

Initially, I was able to do this with hashing:

data WANT; 
if _N_=1 then do;
	declare hash h(dataset: 'HAVE(rename=(ID=CompID A=CompA B=CompB))') ;
	declare hiter iter('h') ;
	h.defineKey('CompID') ;
	h.defineData('CompID','CompA','CompB') ;
	h.defineDone() ;
	call missing(CompID, CompA, CompB);
	end;
set HAVE ;
by ID ;
HashCheck = iter.first();
do while (HashCheck = 0);
	if ID^=CompID then do;
		output;
		end;
	HashCheck = iter.next();
	end;
run;

That worked when ID and A and B were all numeric. However, then I needed to also bring on C and D (as CompC and CompD), which were character variables, and it no longer worked, with the error message:

ERROR: Type mismatch for data variable CompC at line 1237 column 5.
ERROR: Hash data set load failed at line 1237 column 5.
ERROR: DATA STEP Component Object failure.  Aborted during the EXECUTION phase.

I thought I might be able to do it with something kind of like a DoW-loop, or something similar where there was a second SET statement within a DO loop, but that wasn't working either. It would output all 29 observations for the first value of ID from the base dataset, but then nothing else.

I might just end up doing a CROSS JOIN or something like that in PROC SQL. However, it's been bugging me that I can't figure out a way to do it in a DATA step. So, I thought I would ask if anyone had a suggestion for how it could be done in a DATA step.

Thanks!

LinusH · Posted 12-24-2015 03:55 PM

This kind of processing is much easier to acomplish in SQL. So why try to do it in a data step, is it just the challenge?

Data never sleeps

gfarkas · Posted 12-24-2015 03:56 PM

Yeah, I guess. I feel like it's something that should be able to be done in a data step, right?

Haikuo · Posted 12-24-2015 04:13 PM

IMHO, you have already cheated by using Hash . The old school data step will be something like:

data have;
	do id=1 to 30;
		output;
	end;
run;

data want;
	set have;
	nobs=nobs;

	do i=1 to nobs;
		set have (rename=id=id_b) nobs=nobs point=i;

		if id ne id_b then
			output;
	end;
run;

SAS_inquisitive · Posted 12-24-2015 07:46 PM

Haikuo, I did not quite understand the second data step. What is the logic of putting two set statements with same input data set (have)? Can you explain little bit? Thanks !

Haikuo · Posted 12-25-2015 12:22 PM

Before the Hash era, this would be one of ways to do Cartesian join using data step (if you have to). So for every obs from the first 'have', SAS goes through every single obs in the second 'have'. The first 'have' is the default sequential data step from top down, the second 'have' is direct access by using 'point='. Of course there are some other details involved, and frankly, without knowing your current knowledge structure, I have no idea how to start, it could involve lots of reading on your end.

SAS_inquisitive · Posted 12-25-2015 01:06 PM

Thanks, Haikuo. This is basically many-to-many merge. I also wonder how does the nobs after first "have" (nobs=nobs) gets its values after first set statement executes, since there is not nobs option in the first set statement. I know SAS statements are executed sequentially (executable statements).

Haikuo · Posted 12-25-2015 03:02 PM

it gets the value from second 'set' statement, which is done during the compiling stage. 'Set' statement is a kind of 'hybrid', half-executable and half-declarative, so long before the second 'set' is executed, 'nobs' has been set up with a value. There are many old docs/SUGI papers on this topic, I believe they are ready to be found for curious mind like you. Good Luck!

Haikuo

gfarkas · Posted 12-26-2015 10:52 AM

Thanks Haikuo! I didn't know about the NOBS= or POINT= arguments, but after reading about them they look like they will be very useful to me.

Thanks again!

Haikuo · Posted 12-24-2015 04:05 PM

Yes, there are always more than one ways to skin a cat, especially true in SAS, however, in most cases, one or two solutions stand out as the most suitable. To be honest with you, in your case, Proc SQL is among the most suitable approaches, as Cartesian product is what SQL does the best.

data have;
	do id=1 to 30;
		output;
	end;
run;

proc sql;
	create table want as
		select * from have a, have(rename=id=id_b) b
			where a.id ne b.id_b;
quit;

As for your Hash question, it turns out to be simple one. You will need to first define character variables such as CompC or CompB by the following:

	length CompC CompB $ 20;

of course you need to tweak the actual length to match your data. Call missing alone only produces numeric variables by default unless you define it as Char in advance.

PGStats · Posted 12-24-2015 05:17 PM

Very insightful @Haikuo, I would only add that RENAME is not needed since SQL has its own renaming syntax

proc sql;
	create table want as
		select 
			a.*, 
			b.id as id_b 
		from 
			have as a cross join 
			have as b
			where a.id ne b.id;
quit;

PG

KachiM · Posted 12-26-2015 08:01 AM

It is still doable. Your description of what you want is not clear, at least for me. If small sample data set with desired output, if shown, will enthuse people to try out and get a suitable data step solution.

Registration is open

SAS Training: Just a Click Away