DATA Step, Macro, Functions and more

Create datasets based on different dates on date field.

Reply
Occasional Contributor KYW
Occasional Contributor
Posts: 12

Create datasets based on different dates on date field.

Hi all.
I'd like to create datasets based on the different dates in a date column from a base dataset before exporting them to csv files. Eg.

Dates Item
01APR2010 A
01APR2010 B
01APR2010 C
05JUL2010 A
20JUL2010 E

I can only create a dataset (output_item_01APR2010) with only data with 01APR2010 by using the following code.

%let transdt = 01APR2010;

data output_item_&transdt;
set base_table;
where transaction_dt="&transdt"d;
run;

How do I create multiple datasets based on different dates on the date field without having to specify each date using the %let statement?

Thanks!

KYW
Valued Guide
Posts: 634

Re: Create datasets based on different dates on date field.

This is a two pass process. First you need a list of distinct dates and then use that list with a macro %DO loop. This article on sasCommunity does something similar. Ron Fehd has a number of papers on macro list processing.
http://www.sascommunity.org/wiki/Automatically_Separating_Data_into_Excel_Sheets
Let us know if you need more detail.
Occasional Contributor KYW
Occasional Contributor
Posts: 12

Re: Create datasets based on different dates on date field.

Hi ArtC. Thanks for the excellent resources. I've applied the method in the article and it works great.
Regular Contributor
Posts: 241

Re: Create datasets based on different dates on date field.

ArtC wrote:

> This is a two pass process. First you need a list of

> distinct dates and then use that list with a macro

> %DO loop.



No. It doesn't have to be. With a moderate amount of data for each subset, it is easy to implement a one-pass solution using hash.



   /* test data */


   data one;


     input date :date9. item :$1.;


   cards;


   01APR2010 A


   01APR2010 B


   01APR2010 C


   05JUL2010 A


   20JUL2010 E


   ;


   run;


 


   /* separate out to daily datasets. one-pass solution, 


       assuming that the data are already sorted by date. */


   data _null_;


     if 0 then set one; /* to prep pdv */


     dcl hash h(ordered:'ascending');


     h.definekey('id');


     h.definedata('item');


     h.definedone();


 


     do until (last.date);


       set one;


       by date notsorted;


       id + 1;


       h.add();


     end;


     


     name = catt("item_", put(date,date9.));


     h.output(dataset: name);


   run;   


   /* on log


   NOTE: The data set WORK.ITEM_01APR2010 has 3 observations and 1 variables.


   NOTE: The data set WORK.ITEM_05JUL2010 has 1 observations and 1 variables.


   NOTE: The data set WORK.ITEM_20JUL2010 has 1 observations and 1 variables.


   */

Super User
Posts: 10,023

Re: Create datasets based on different dates on date field.

[pre]
data temp;
input Dates : date9. Item $;
format dates date9.;
cards;
01APR2010 A
01APR2010 B
01APR2010 C
05JUL2010 A
20JUL2010 E
;
proc sql feedback;
select distinct dates from temp;
select distinct dates
into : date1 - : date&sqlobs.
from temp;
quit;
options mprint mlogic;
%macro create_tables;
data %do j=1 %to &sqlobs.;
output_item_&&date&j
%end;
;
set temp;
select ( dates );
%do i=1 %to &sqlobs.;
when ("&&date&i"d) output output_item_&&date&i;
%end;
otherwise;
end;
run;
%mend;

%create_tables
[/pre]



Ksharp Message was edited by: Ksharp
Occasional Contributor KYW
Occasional Contributor
Posts: 12

Re: Create datasets based on different dates on date field.

Hi KSharp. Nice code there. Will try this out to optimize my current coding.
Valued Guide
Posts: 2,177

Re: Create datasets based on different dates on date field.

KYW
as always SAS offers more than one way to do this kind of thing. So here is a hybrid of sql and data steps.
I like using SQL to help solve this problem because working on windows so much I want to avoid I/O as much as possible so read as little data as possible.
For each date value, your original model solution reads the whole table (assuming no index on data would exist and be used in this case).
Quite efficiently SQL will create a list of just the date values into a new table. From the new table SQL is good/convenient at filling a macro variable with the syntax inolving the dates.

*1 first collect a list of the dates;[pre]proc sql noprint ;
create table dates as select distinct date from your.data ;[/pre]* now create the pieces of syntax that will refer to dates;[pre]%let output_table_prefix = your.output_item_ ;
select "&output_table_prefix" !! put( date, date9. ) length= 41
into : tables separated by ' ' from dates ;[/pre]* the other date piece is the decisions to write to each dated-table ;[pre] select ' date=' !! put( date, z6.) !!
" then output &output_table_prefix" !! put( date, date9. )
into Smiley Surprisedutputs separated by '; else if ' from dates ;
quit ;[/pre]* And finally assemble these bits into a (not long) data step (because SQL seems able to write only one table at a time, and data steps can write all the dated-tables you need in a single pass ;[pre]option symbolgen ;
data &tables ;
set your.data ;
if &outputs ;
else do ;
put 'unexpected ' date= date9. ' on row ' _n_ ;
stop ; * very unexpected ;
end ;
run ;
[/pre]In this approach your data are read twice: first (a light touch) to create the distinct dates list; and second, to read and write the data to the date-named tables.
Should be quick.

Peter
Occasional Contributor KYW
Occasional Contributor
Posts: 12

Re: Create datasets based on different dates on date field.

Hi Peter.

Thank you for your detailed explanation.

Regarding your codes above, specifically in the final part (data &table), what if I have a master dataset of around 100GB? Would the set statement slow down the process if the master dataset is not splitted into smaller chunks based on a specific period?
Valued Guide
Posts: 2,177

Re: Create datasets based on different dates on date field.

> Regarding your codes above, specifically in the final
> part (data &table), what if I have a master dataset
> of around 100GB? Would the set statement slow down

no
the solution is optimal for data not sorted or indexed on that date and was designed for application when it is difficult, not only to re-write the data, but particularly difficult to sort 100GB.

The only issue for performance I see, might occur where you have very many distinct dates. But that is also a problem for your application = "what use is it to have 1000 tables from 3 years of data?"

peter
Super User
Posts: 10,023

Re: Create datasets based on different dates on date field.

Yes.Peter.
You have mentioned a problem in your code.
Macro variable has limited length (32767 or something else),That is to say when &outputs can not hold all the value from PROC SQL, will trigger an error.


Ksharp.
Valued Guide
Posts: 2,177

Re: Create datasets based on different dates on date field.

Ksharp
recently a usage note circulated reminding me that there is a 64K limit on macro variable length.
However, I think there might be serious environment limitations if a tenth of that limit were used to hold the &outputs macro variable.
Have you ever created so many tables in one data step?
100 tables would need perhaps up to 40 characters of the macro variable each just 4K.
I think the problem is not so much the limitation of macro variable widths, as :
1 is there enough memory for a data step to support the metadata required for all 100 tables?
2 what does an application want to do with the 100 tables ?

I think the environment will fail before the capacity of the macro variable is found to be a limit.
(maybe not as I have no trouble running something as simple as
%macro gen(n, pattern=###, from=1) ;
%do i= &from %to &n ;
%sysfunc( tranwrd( %superq(pattern), ###, &i ))
%end ;
%mend ;
%put demo %gen( 7, from=4,pattern= abc###ab### ) ;

option fullstimer ;
data %gen( 1111, pattern= test### );
set sashelp.class ;
run;
which does successfully create 1111 copies of sashelp.class with these fullstimer notes[pre]NOTE: The data set WORK.TEST1110 has 19 observations and 5 variables.
NOTE: The data set WORK.TEST1111 has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 12.98 seconds
user cpu time 0.73 seconds
system cpu time 7.37 seconds
Memory 36156k
OS Memory 43544k
Timestamp 22/12/2010 11:28:52[/pre] That OS Memory usage scales up with the number of datasets created - and I imagine also with the number of variables.

I'm still betting that the 64K capacity of a macro variable won't be the limiting factor for the application.

Do you have the optimal alternative that minimises the number of passes through the 100GB while not suffering that limited capacity in &outputs?

hth

peterC
Super User
Posts: 10,023

Re: Create datasets based on different dates on date field.

Hi.Peter.
>Have you ever created so many tables in one data step?

I have not created so many datasets before.But I used the method you mentions to label the variables in lots of datasets before,finally the macro variable &outputs overflows the max length of itself under UNIX environment.So I have to sparse the &outputs as I mentioned.

And your gived example is very interesting and useful for me.
Admittedly this way is very to waste OS source,Also hope someone can give more optimal
alternative. About @data _null_;?? It is long time no see him/her and Mr.Patrick.




Ksharp
Valued Guide
Posts: 2,177

Re: Create datasets based on different dates on date field.

> Hi.Peter.
>
> alternative. About @data _null_;?? It is long time no
> see him/her and Mr.Patrick.
> Ksharp

Ksharp
( > you ask where are two most valuable posters?)

maybe they are missing because of local culture.
In this country little children are warned to be good or Father Christmas might not come.
Next Saturday, Santa Clause alias Father Christmas, will have delivered presents (overnight) to all the good children.
Now since this culture developed BI (before the interrnet), delivering to all the good children in one night takes a lot of planning and preparation and help, so perpaps the missing posters are among Santa's Elves (those who help Santa)
;-)
I understand some of this culture has spread to other parts of the world so now Santa's task is even bigger ;-))

merry christmas
Valued Guide
Posts: 2,177

Re: Create datasets based on different dates on date field.

Ksharp
pardon my winter joke ( :-) about the elves.

As to oveflowing macro variable capacity, there is always a choice of solutions, but since so much depends on the application, I would treat each as a special case. Once when I bumped into the 32K limit (similar to your case asigning labels in a macro variable) I mistakenly thought it was a statement limit, but it is just the limited capacity of the macro var. My quick work around was a generate macro and array of macro variables. one for each label. PROC SQL still filled the values, just into :lab1 - :lab9999 instead of into :labels separated by ' '
and collect the number of macro variables created from &sqlobs
and use something like that %gen() macro to generated the list %gen( &n_labs, pattern=%str(&)lab###)
Here is a demo to clarify
proc sql noprint ;
select cats( name, '=', quote(trim(coalesce(label,name))) )
into :lab1 - :lab9999
from dictionary.columns
where libname='SASHELP' and memname='CLASS'
;
%let n_labs=&sqlobs ;
quit ;
%put label %gen( &n_labs, pattern=%str(&)lab### );

demonstrates building a label statement
label Name="Name" Sex="Sex" Age="Age" Height="Height" Weight="Weight"

Each application is different when environment limits are approached, so I choose my preference when I know more.

Peter
Super User
Posts: 10,023

Re: Create datasets based on different dates on date field.

Dear Peter.
Good Holidays!

>I understand some of this culture has spread to other parts of the world so now Santa's task is even bigger ;-))


Yes.You are right . In China (Bei Jing) where I lived in, there are lots of Christmas Trees and Christmas Songs around me,and It is hard to see foreign friends now, maybe they have gone to their homeland to holiday.Unfortunately I have to work even though it is not too heavy,We will get Holidays until February 2,2011 that is Chinese Tradition New Year.

After look at your code, I found some problem in it.
First,Macro variable &sqlobs is only generated after sql statement executed,that is to mean you need execute the same sql statement twice.
Second.You label variable by using data step,which will scan the whole data set and will very slow when this dataset is very big (about 200G),So can use 'proc datasets +modify+label',that will be more efficient than yours ,since 'proc datasets' will not read the dataset in PDV and modify label directly.
BTW, the approach I used is what you mentioned.

Merry Christmas!
Cheers.
Ksharp
Ask a Question
Discussion stats
  • 16 replies
  • 1552 views
  • 0 likes
  • 5 in conversation