BookmarkSubscribeRSS Feed
gfarkas
Calcite | Level 5

Hi all, I'm trying to figure out a way to combine a dataset back onto itself, while merging on rows only when the ID variable doesn't match. For instance, let's say the dataset HAVE has 30 rows, with ID = 1 to 30, and a bunch of other variables. I want to merge HAVE back onto itself, so that the row with ID = 1 now becomes 29 rows, with each of the 29 taking on the values from ID = 2 to 30 for the other variables. Then, the row that originally had ID = 2 would take on the values from ID = 1 and then ID = 3 to 30. And so on.

 

Initially, I was able to do this with hashing:

data WANT; 
if _N_=1 then do;
	declare hash h(dataset: 'HAVE(rename=(ID=CompID A=CompA B=CompB))') ;
	declare hiter iter('h') ;
	h.defineKey('CompID') ;
	h.defineData('CompID','CompA','CompB') ;
	h.defineDone() ;
	call missing(CompID, CompA, CompB);
	end;
set HAVE ;
by ID ;
HashCheck = iter.first();
do while (HashCheck = 0);
	if ID^=CompID then do;
		output;
		end;
	HashCheck = iter.next();
	end;
run; 

That worked when ID and A and B were all numeric. However, then I needed to also bring on C and D (as CompC and CompD), which were character variables, and it no longer worked, with the error message:

ERROR: Type mismatch for data variable CompC at line 1237 column 5.
ERROR: Hash data set load failed at line 1237 column 5.
ERROR: DATA STEP Component Object failure.  Aborted during the EXECUTION phase.

I thought I might be able to do it with something kind of like a DoW-loop, or something similar where there was a second SET statement within a DO loop, but that wasn't working either. It would output all 29 observations for the first value of ID from the base dataset, but then nothing else.

 

I might just end up doing a CROSS JOIN or something like that in PROC SQL. However, it's been bugging me that I can't figure out a way to do it in a DATA step. So, I thought I would ask if anyone had a suggestion for how it could be done in a DATA step.

 

Thanks!

11 REPLIES 11
LinusH
Tourmaline | Level 20
This kind of processing is much easier to acomplish in SQL. So why try to do it in a data step, is it just the challenge?
Data never sleeps
gfarkas
Calcite | Level 5

Yeah, I guess. I feel like it's something that should be able to be done in a data step, right?

Haikuo
Onyx | Level 15

IMHO, you have already cheated by using Hash Smiley Wink. The old school data step will be something like:

data have;
	do id=1 to 30;
		output;
	end;
run;

data want;
	set have;
	nobs=nobs;

	do i=1 to nobs;
		set have (rename=id=id_b) nobs=nobs point=i;

		if id ne id_b then
			output;
	end;
run;
SAS_inquisitive
Lapis Lazuli | Level 10

Haikuo, I did not quite understand the second  data step.  What is the logic of putting two set statements with same input data set (have)?  Can you explain little bit?  Thanks !

Haikuo
Onyx | Level 15

Before the Hash era, this would be one of ways to do Cartesian join using data step (if you have to). So for every obs from the first 'have', SAS goes through every single obs in the second 'have'. The first 'have' is the default sequential data step from top down, the second 'have' is direct access by using 'point='. Of course there are some other details involved, and frankly, without knowing your current knowledge structure, I have no idea how to start, it could involve lots of reading on your end.

 

SAS_inquisitive
Lapis Lazuli | Level 10
Thanks, Haikuo. This is basically many-to-many merge. I also wonder how does the nobs after first "have" (nobs=nobs) gets its values after first set statement executes, since there is not nobs option in the first set statement. I know SAS statements are executed sequentially (executable statements).
Haikuo
Onyx | Level 15

it gets the value from second 'set' statement, which is done during the compiling stage. 'Set' statement is a kind of 'hybrid', half-executable and half-declarative, so long before the second 'set' is executed, 'nobs' has been set up with a value. There are many old docs/SUGI papers on this topic, I believe they are ready to be found for curious mind like you. Good Luck!

 

Haikuo

gfarkas
Calcite | Level 5

Thanks Haikuo! I didn't know about the NOBS= or POINT= arguments, but after reading about them they look like they will be very useful to me.

 

Thanks again!

Haikuo
Onyx | Level 15

Yes, there are always more than one ways to skin a cat, especially true in SAS, however, in most cases, one or two solutions stand out as the most suitable. To be honest with you, in your case, Proc SQL is among the most suitable approaches, as Cartesian product is what SQL does the best.

 

data have;
	do id=1 to 30;
		output;
	end;
run;

proc sql;
	create table want as
		select * from have a, have(rename=id=id_b) b
			where a.id ne b.id_b;
quit;

 As for your Hash question, it turns out to be simple one. You will need to first define character variables such as CompC or CompB by the following:

	length CompC CompB $ 20;

of course you need to tweak the actual length to match your data. Call missing alone only produces numeric variables by default unless you define it as Char in advance.

 

PGStats
Opal | Level 21

Very insightful @Haikuo, I would only add that RENAME is not needed since SQL has its own renaming syntax

 

proc sql;
	create table want as
		select 
			a.*, 
			b.id as id_b 
		from 
			have as a cross join 
			have as b
			where a.id ne b.id;
quit;

   

PG
KachiM
Rhodochrosite | Level 12

It is still doable. Your description of what you want is not clear, at least for me. If small sample data set with desired output, if shown, will enthuse people to try out and get a suitable data step solution.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2314 views
  • 3 likes
  • 6 in conversation