Solved: Re: Output Multiple Datasets in Datastep with indirect name references

jlevine · Posted 11-16-2009 09:27 AM

I want to output multiple datasets from a single data step, but I want the "if" statement that determines which data set to which I output a given record to be determined from the value of a variable.

For example, I want to process data and loop through a set of MARKET_IDs, calculating various information for that market, and then output the data to a data set for each market.

I know how to do this using if-then-else constructs, but that takes a lot of processing time given that I have 9 million+ customer records that I want to output to 1500+ market data sets. I currently have this written using macro code, but here is a simplified version of what I am doing:

data market1 market2 market3;
set customer_data;
do market_id=1 to 3;
*do a bunch of calculations here;
if market_id=1 then output market1;
else if market_id=2 then output market2;
else if market_id=3 then output market3;
end;
run;

The problem is that with 1500+ output datasets, a given record has to go through on average 750 if-then checks before it finds its home. I would like something like this (which I know does not work because call symput only populates the variable at the end of the data step, not in the middle).

data market1 market2 market3;
set customer_data;
do market_id=1 to 3;
*do a bunch of calculations here;
call symput ('mkt_id',market_id);
output market&mkt_id.;
end;
run;

Is there any way to do this? It would be sort of like the indirect() function in Excel - referring to a dataset based on an indirect reference. I have tried call execute, macros, etc., and no luck.

DanielSantos · Posted 11-16-2009 10:24 AM

Editor's Note: Thanks to Daniel for pointing us to using a hash object as the solution. The paper by Paul Dorfman that Daniel points to contains an example of how to efficiently output to multiple data sets. Look for the section entiled Splitting a SAS File Dynamically using the .OUTPUT() Method. Example 9 shows how to use the .OUTPUT() method to dynamically output to multiple data sets. If you are not familar with hash objects in SAS, example 8 in this paper shows a non-hash method that uses PROC SQL to dynamically output to multiple data sets.

Hi.

You cannot mix macro code generated at compile-time with code evaluated at run-time.

Their is actually a workaround that I'm seeing, and that I have used in the past with success. Because of memory limitations, the data should be pre-ordered by the splitter variable (market_id).
The idea is to load into a hash object each segment (market_id) while reading the dataset and whenever reaching the next segment, output the hash to a dataset (output method), empty it, and then do the same for the next segment, until the entire dataset is processed.
The output method of the hash object allows to specify an expression for the dataset naming, so you could easily suffix some number to the dataset name.
Be aware that a hash object is entirely stored in memory, so you should be reasonable about the amount of data you store in it. This explains why you should do it by segments, instead of dealing with the whole customer records.

Check the online doc about the hash object:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a003143739.htm

The output method:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a002588845.htm

And don't forget to read the excellent Paul Dorfman's paper, with particular attention for the 'SPLITTING A SAS FILE DYNAMICALLY USING THE .OUTPUT() METHOD' section:
http://www2.sas.com/proceedings/sugi30/236-30.pdf

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

View solution in original post

SAS_user · Posted 11-16-2009 10:05 AM

Hello,

its a bit complicated, to use such a lot if-then-else.

I would offer you to make such algorithm, where if-then-else is eliminated.
This algorithm is based on using some extra sorting and proc freq.

First of all output your values in a table _tmp1_.
Then sort it by market_id.

Use proc freq to callculate measures.
For example You get such results in a table market_id_freqs
Market_id _freq_ sumFreq_
1 100 100
2 101 201
3 102 303
...
1500 100 sumvalue

Now then they are sorted, you know exactly when every market_id starts and when ends in a table _tmp_.(for instance records with market_id 2 are from numbers 101 and 201 )
After this create a macroLoop which reads table market_id_freqs with and does such a result.

data market&i;
set _tmp1_ (firstobs = &sum_freq - &freq+1 obs = &sum_freq);
run;
where
&freq is a _freq_ column
&sum_freq is a sumFreq_ colum

First three steps could be like this:
data market1;
set _tmp1_ (firstobs = 1 obs=100); /* 100 - 100 + 1 */
run;

data market2;
set _tmp1_ (firstobs = 101 obs = 201); /* 201 - 101 + 1 */
run;

data market3;
set _tmp1_(firstobs = 202 obs = 303); /* 303 -102 + 1 */
run;

Good luck.

deleted_user · Posted 11-16-2009 10:21 AM

Hi,

I think you should look at this from another angle. Consider this program

DATA test;
var1 = "A";
var2 = 1;
OUTPUT;
var1 = "B";
var2 = 2;
OUTPUT;
RUN;

DATA test1 (WHERE=(var2=1))
test2 (WHERE=(var2=2));
SET test;
RUN;

You can replace of course the data part of the datastep with a macro...
This way, the datastep writes to all datasets, but only the records taht correspond to the criterium per dataset are kept...

DanielSantos · Posted 11-16-2009 10:24 AM

Editor's Note: Thanks to Daniel for pointing us to using a hash object as the solution. The paper by Paul Dorfman that Daniel points to contains an example of how to efficiently output to multiple data sets. Look for the section entiled Splitting a SAS File Dynamically using the .OUTPUT() Method. Example 9 shows how to use the .OUTPUT() method to dynamically output to multiple data sets. If you are not familar with hash objects in SAS, example 8 in this paper shows a non-hash method that uses PROC SQL to dynamically output to multiple data sets.

Hi.

You cannot mix macro code generated at compile-time with code evaluated at run-time.

Their is actually a workaround that I'm seeing, and that I have used in the past with success. Because of memory limitations, the data should be pre-ordered by the splitter variable (market_id).
The idea is to load into a hash object each segment (market_id) while reading the dataset and whenever reaching the next segment, output the hash to a dataset (output method), empty it, and then do the same for the next segment, until the entire dataset is processed.
The output method of the hash object allows to specify an expression for the dataset naming, so you could easily suffix some number to the dataset name.
Be aware that a hash object is entirely stored in memory, so you should be reasonable about the amount of data you store in it. This explains why you should do it by segments, instead of dealing with the whole customer records.

Check the online doc about the hash object:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a003143739.htm

The output method:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a002588845.htm

And don't forget to read the excellent Paul Dorfman's paper, with particular attention for the 'SPLITTING A SAS FILE DYNAMICALLY USING THE .OUTPUT() METHOD' section:
http://www2.sas.com/proceedings/sugi30/236-30.pdf

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

SASPhile · Posted 11-16-2009 10:45 AM

I had to work on the similar situation.There is what I did:

%macro test(bu);
data _null_;
set prcflw.mdmc_&bu._mkt_prd end = eof;
if eof then call symput('endmkt',_n_);
xvar = compress('SASMKT' || _n_);
call symput (xvar,sas_mkt_cd);
run;
%do
i= 1 %to &endmkt;
data &&sasmkt&i;
set &&sasmkt&i;
run;
%end;
%mend;
%test(bu);

Bal23 · Posted 07-05-2016 09:53 AM

Thank you very much. This is very helpful. But I do not quite understand. Would you please explain it to me for each part of your code? Thank you again

jlevine · Posted 11-16-2009 10:49 AM

Thanks for the quick response. The hash thing might do it for me. The other methods don't work because I am working with so much data and trying to avoid writing everything out to one big dataset.

I am reading in the customers one record at a time, and then looping through all of the markets. If a customer's stats for the given market meet a threshold, I output that record. Otherwise I don't want to output anything. I can't just build a hash for the whole market and output it at the end, because it would be way too big and blow out my memory. Can I output the customer record at each iteration and then remove it from the hash, solving the memory problems? Once I've output the record, I don't need it anymore.

Of course, if this isn't faster than my current method, it's moot, but it is worth a try. Thanks!

sbb · Posted 11-16-2009 11:38 AM

Consider using a SAS INDEX on the master file and generate (with macro code) iterative DATA step executions to filter/split your file, creating smaller individual unilt files.

Scott Barry
SBBWorks, Inc.

Peter_C · Posted 11-16-2009 12:07 PM

can we use ODS OUTPUT to create datasets at by-group level?

abdullala · Posted 11-17-2009 12:20 PM

for a 9 mil record non-sorted data I believe 'where' is faster than 'if-then', because it reads in only a subset of the 9 mil, the memory should not be a problem either.

will be best if can establish an index on market_id before processing.

%macro test;
%do i=1 %to 1500;
data market&i;
set customer_data(where=(market_id=&i));
...;
run;
%end;
%mend;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away