12-24-2015 03:22 PM
Hi all, I'm trying to figure out a way to combine a dataset back onto itself, while merging on rows only when the ID variable doesn't match. For instance, let's say the dataset HAVE has 30 rows, with ID = 1 to 30, and a bunch of other variables. I want to merge HAVE back onto itself, so that the row with ID = 1 now becomes 29 rows, with each of the 29 taking on the values from ID = 2 to 30 for the other variables. Then, the row that originally had ID = 2 would take on the values from ID = 1 and then ID = 3 to 30. And so on.
Initially, I was able to do this with hashing:
data WANT; if _N_=1 then do; declare hash h(dataset: 'HAVE(rename=(ID=CompID A=CompA B=CompB))') ; declare hiter iter('h') ; h.defineKey('CompID') ; h.defineData('CompID','CompA','CompB') ; h.defineDone() ; call missing(CompID, CompA, CompB); end; set HAVE ; by ID ; HashCheck = iter.first(); do while (HashCheck = 0); if ID^=CompID then do; output; end; HashCheck = iter.next(); end; run;
That worked when ID and A and B were all numeric. However, then I needed to also bring on C and D (as CompC and CompD), which were character variables, and it no longer worked, with the error message:
ERROR: Type mismatch for data variable CompC at line 1237 column 5. ERROR: Hash data set load failed at line 1237 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
I thought I might be able to do it with something kind of like a DoW-loop, or something similar where there was a second SET statement within a DO loop, but that wasn't working either. It would output all 29 observations for the first value of ID from the base dataset, but then nothing else.
I might just end up doing a CROSS JOIN or something like that in PROC SQL. However, it's been bugging me that I can't figure out a way to do it in a DATA step. So, I thought I would ask if anyone had a suggestion for how it could be done in a DATA step.
12-24-2015 03:55 PM
12-24-2015 04:13 PM
IMHO, you have already cheated by using Hash . The old school data step will be something like:
data have; do id=1 to 30; output; end; run; data want; set have; nobs=nobs; do i=1 to nobs; set have (rename=id=id_b) nobs=nobs point=i; if id ne id_b then output; end; run;
12-24-2015 07:46 PM
Haikuo, I did not quite understand the second data step. What is the logic of putting two set statements with same input data set (have)? Can you explain little bit? Thanks !
12-25-2015 12:22 PM
Before the Hash era, this would be one of ways to do Cartesian join using data step (if you have to). So for every obs from the first 'have', SAS goes through every single obs in the second 'have'. The first 'have' is the default sequential data step from top down, the second 'have' is direct access by using 'point='. Of course there are some other details involved, and frankly, without knowing your current knowledge structure, I have no idea how to start, it could involve lots of reading on your end.
12-25-2015 01:06 PM
12-25-2015 03:02 PM
it gets the value from second 'set' statement, which is done during the compiling stage. 'Set' statement is a kind of 'hybrid', half-executable and half-declarative, so long before the second 'set' is executed, 'nobs' has been set up with a value. There are many old docs/SUGI papers on this topic, I believe they are ready to be found for curious mind like you. Good Luck!
12-26-2015 10:52 AM
Thanks Haikuo! I didn't know about the NOBS= or POINT= arguments, but after reading about them they look like they will be very useful to me.
12-24-2015 04:05 PM - edited 12-24-2015 04:07 PM
Yes, there are always more than one ways to skin a cat, especially true in SAS, however, in most cases, one or two solutions stand out as the most suitable. To be honest with you, in your case, Proc SQL is among the most suitable approaches, as Cartesian product is what SQL does the best.
data have; do id=1 to 30; output; end; run; proc sql; create table want as select * from have a, have(rename=id=id_b) b where a.id ne b.id_b; quit;
As for your Hash question, it turns out to be simple one. You will need to first define character variables such as CompC or CompB by the following:
length CompC CompB $ 20;
of course you need to tweak the actual length to match your data. Call missing alone only produces numeric variables by default unless you define it as Char in advance.
12-24-2015 05:17 PM
Very insightful @Haikuo, I would only add that RENAME is not needed since SQL has its own renaming syntax
proc sql; create table want as select a.*, b.id as id_b from have as a cross join have as b where a.id ne b.id; quit;
12-26-2015 08:01 AM - edited 12-26-2015 08:04 AM
It is still doable. Your description of what you want is not clear, at least for me. If small sample data set with desired output, if shown, will enthuse people to try out and get a suitable data step solution.