Hello,
I am trying to separate the dataset based on the year of each observation (from 1985 to 2020). Is there any method to simplify the following code? I do not want to use %MACRO because it is a large dataset.
data want_1985 want_1986 want_1987 ... want_2019 want_2020 ; set have; if year(appln_filing_date)=1985 then output want_1985; if year(appln_filing_date)=1986 then output want_1986; ... if year(appln_filing_date)=2019 then output want_2019; if year(appln_filing_date)=2020 then output want_2020; run;
Many thanks in advance.
After some tests i have to adapt expectations to reality:
I created a dataset with 635000 obs and 21 variables. Splitting with call execute takes nearly five seconds, while using a macro the split was done less than half a second.
The macro:
%macro split;
%local minYear maxYear;
proc sql noprint;
select min(year(appln_filing_date)), max(year(appln_filing_date))
into :minYear trimmed, :maxYear trimmed
from work.have;
quit;
%local year;
data
%do year = &minYear. %to &maxYear.;
work.want_&year.
%end;
;
set work.have;
select(year(appln_filing_date));
%do year = &minYear. %to &maxYear.;
when (&year) output work.want_&year.;
%end;
end;
run;
%mend;
%split;
That's about as simple as you can make it unless you are willing to use a Macro.
But what's this about a large dataset and a Macro? Macros just generate SAS code. Macros don't cause problems with large datasets -- as long as you generate good code with your Macro. How the code performs with a large dataset is dependent on your programming skill, not on the Macro facility.
OK, now, look. First you need two little macros, like this:
%MACRO Gen_Dataset_Names(First, Last);
%DO i = &First %TO &Last;
Want_&i
%END;
%MEND Gen_Dataset_Names;
%MACRO Gen_If_Statements(First, Last);
%DO i = &First %TO &Last;
%IF &i > &First %THEN
%DO;
else
if year(appln_filing_date) = &i then output want_&i;
%END;
%ELSE
%DO;
if year(appln_filing_date) = &i then output want_&i;
%END;
%END;
%MEND Gen_If_Statements;
The first generates the output dataset names for the Data statement. The second generates the If statements. By the way, notice the "else" I've placed in with the If statements. This else will make your code execute faster than the way that you had it.
We can test this code with the following Put statements, but we have to remove the semi-colons (temporarily) from the If statements or there will be an error.
%PUT %Gen_Dataset_Names(1986, 2020);
%PUT %Gen_If_Statements(1986, 2020);
Here's the result from the first. It's just text. Macros are not innately slow.
Want_1986 Want_1987 Want_1988 Want_1989 Want_1990 Want_1991 Want_1992 Want_1993 Want_1994 Want_1995 Want_1996 Want_1997 Want_1998 Want_1999 Want_2000 Want_2001 Want_2002 Want_2003 Want_2004 Want_2005 Want_2006 Want_2007 Want_2008 Want_2009 Want_2010 Want_2011 Want_2012 Want_2013 Want_2014 Want_2015 Want_2016 Want_2017 Want_2018 Want_2019 Want_2020
Here's the result from the second.
if year(appln_filing_date) = 1986 then output want_1986 else if year(appln_filing_date) = 1987 then output want_1987
else if year(appln_filing_date) = 1988 then output want_1988 else if year(appln_filing_date) = 1989 then
output want_1989 else if year(appln_filing_date) = 1990 then output want_1990 else if
year(appln_filing_date) = 1991 then output want_1991 else if year(appln_filing_date) = 1992 then output want_1992
else if year(appln_filing_date) = 1993 then output want_1993 else if year(appln_filing_date) = 1994 then output
want_1994 else if year(appln_filing_date) = 1995 then output want_1995 else if year(appln_filing_date) = 1996 then output want_1996 else if year(appln_filing_date) = 1997 then output want_1997 else if
year(appln_filing_date) = 1998 then output want_1998 else if year(appln_filing_date) = 1999 then output want_1999
else if year(appln_filing_date) = 2000 then output want_2000 else if year(appln_filing_date) = 2001 then output
want_2001 else if year(appln_filing_date) = 2002 then output want_2002 else if year(appln_filing_date) =
2003 then output want_2003 else if year(appln_filing_date) = 2004 then output want_2004 else if
year(appln_filing_date) = 2005 then output want_2005 else if year(appln_filing_date) = 2006 then output want_2006
else if year(appln_filing_date) = 2007 then output want_2007 else if year(appln_filing_date) = 2008 then output
want_2008 else if year(appln_filing_date) = 2009 then output want_2009 else if year(appln_filing_date) =
2010 then output want_2010 else if year(appln_filing_date) = 2011 then output want_2011 else if
year(appln_filing_date) = 2012 then output want_2012 else if year(appln_filing_date) = 2013 then output want_2013
else if year(appln_filing_date) = 2014 then output want_2014 else if year(appln_filing_date) = 2015 then output
want_2015 else if year(appln_filing_date) = 2016 then output want_2016 else if year(appln_filing_date) =
2017 then output want_2017 else if year(appln_filing_date) = 2018 then output want_2018 else if
year(appln_filing_date) = 2019 then output want_2019 else if year(appln_filing_date) = 2020 then output want_2020
It's not very tidy, but it will run (once we put the semi colons back in on the If statements).
The final program would look like the below. I don't have your data otherwise I would run it. I could have a typo or something, but this is in general proper SAS code.
DATA %Gen_Dataset_Names(1986, 2020);
SET Have;
%Gen_If_Statements(1986, 2020);
RUN;
Jim
Why do you want to separate the dataset at all?
You could simplify the code by using call execute:
proc sql;
create table work.years as
select distinct year(appln_filing_date) as year
from have;
quit;
data _null_;
set work.years;
call execute(cats('data work.want_', year, ';'));
call execute('set work.have;');
call execute(cats('where year(appln_filing_date)=', year, ';'));
call execute('run;');
run;
Very good, @andreas_lds! That's an excellent idea to use Call Execute in that fashion.
One caution perhaps: If it's a truly large dataset, then one could have a performance problem. The entire dataset must be read for every one of the Call Executes, which, for years 1986 through 2020, would be about 35 times. If, say, the dataset were 500,000,000 rows, then you could be in for a bit of a wait. 🙂
If one executes the code in a single Data step, as in my expanded reply, above, then the dataset would be read only once. I like the compactness of your solution, and if there were a performance problem, then one could switch to a single pass solution such as the one I added to my initial reply.
Jim
I expect that using where to filter the observations can compensate the overhead partially, but you are right, a macro will most likely perform better, if the data is really large. Maybe @Alexxxxxxx can share some facts about the dimension of the dataset: number of variables, number of observations.
After some tests i have to adapt expectations to reality:
I created a dataset with 635000 obs and 21 variables. Splitting with call execute takes nearly five seconds, while using a macro the split was done less than half a second.
The macro:
%macro split;
%local minYear maxYear;
proc sql noprint;
select min(year(appln_filing_date)), max(year(appln_filing_date))
into :minYear trimmed, :maxYear trimmed
from work.have;
quit;
%local year;
data
%do year = &minYear. %to &maxYear.;
work.want_&year.
%end;
;
set work.have;
select(year(appln_filing_date));
%do year = &minYear. %to &maxYear.;
when (&year) output work.want_&year.;
%end;
end;
run;
%mend;
%split;
One dynamically created data step will always be faster than a series of data steps, one for every selection. In the case of years, one can create a macro that does the loops internally:
%macro split(start,end);
%local i;
data
%do i = &start. %to %end.;
want&i.
%end;
;
set have;
%do i = &start. %to &end.;
if year = &i. then output want&i.;
%end;
run;
%mend;
%split
But still, splitting will move the contents of the dataset three times through I/O:
and all this before you have done ANY analysis.
In all my professional work with SAS, I have done this exactly once, and only because sorting a large dataset would have cracked my quota, preventing me from testing a production program in my personal environment (the batch user has no quotas). I split along the top-level sort variable, sorted/summarized the splits, and concatenated them back. It also has the side effect that the program can't eat too much work storage, possibly colliding with other batch jobs running concurrently.
@andreas_lds wrote:
select(year(appln_filing_date)); %do year = &minYear. %to &maxYear.; when (&year) output work.want_&year.; %end;
Nice use of the select statement. 👍
Regarding the use of:
proc sql noprint;
select min(year(appln_filing_date)), max(year(appln_filing_date))
into :minYear trimmed, :maxYear trimmed
from work.have;
quit;
that's going to depend a bit on the business need, at least that's how it works in my company. Our database has many years in it, but we typically only want the current year + the prior three years. So, is it better to use SQL to detect which years are present or is it better to build the macro with years as arguments like the below?
%Gen_Dataset_Names(1986, 2020);
Well, as say, "it depends." 🙂
Now, as for the CALL EXECUTEs that you coded, even if it's not the fastest here, it's still a valuable technique to be aware of -- launching multiple steps from within a single Data step. CALL EXECUTEs are run serially, i.e. one after the other. Another option, instead of CALL EXECUTE is RSUBMIT (or SYSTASK) which launches processes in parallel, i.e. concurrently. Parallel processing, depending on the requirements, can significantly speed overall processing.
Jim
Define "large" in terms of observation count and physical dataset size.
Which analysis do you intend to do? All analysis procedures allow the use of BY or CLASS, so splitting is not necessary.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: