BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Hello,

 

I am trying to separate the dataset based on the year of each observation (from 1985 to 2020). Is there any method to simplify the following code? I do not want to use %MACRO because it is a large dataset.

 

	data want_1985
		want_1986
		want_1987
		...
		want_2019
		want_2020
		;
		set have;
		if year(appln_filing_date)=1985 then output want_1985;
		if year(appln_filing_date)=1986 then output want_1986;
		...
		if year(appln_filing_date)=2019 then output want_2019;
		if year(appln_filing_date)=2020 then output want_2020;
	run;

Many thanks in advance.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

After some tests i have to adapt expectations to reality:

I created a dataset with 635000 obs and 21 variables. Splitting with call execute takes nearly five seconds, while using a macro the split was done less than half a second.

The macro:

%macro split;
   %local minYear maxYear;
   proc sql noprint;
      select min(year(appln_filing_date)), max(year(appln_filing_date)) 
         into :minYear trimmed, :maxYear trimmed
         from work.have;
   quit;
   
   %local year;
   
   data 
      %do year = &minYear. %to &maxYear.;
         work.want_&year.
      %end;
      ;
      
      set work.have;
      
      select(year(appln_filing_date));
         %do year = &minYear. %to &maxYear.;
            when (&year) output work.want_&year.;
         %end;
      end;      
   run;
%mend;

%split;

View solution in original post

8 REPLIES 8
jimbarbour
Meteorite | Level 14

That's about as simple as you can make it unless you are willing to use a Macro.

 

But what's this about a large dataset and a Macro?  Macros just generate SAS code.  Macros don't cause problems with large datasets -- as long as you generate good code with your Macro.  How the code performs with a large dataset is dependent on your programming skill, not on the Macro facility.

 

OK, now, look.  First you need two little macros, like this:

%MACRO	Gen_Dataset_Names(First, Last);
	%DO	i	=	&First	%TO	&Last;
		Want_&i
	%END;
%MEND	Gen_Dataset_Names;

%MACRO	Gen_If_Statements(First, Last);
	%DO	i	=	&First	%TO	&Last;
		%IF	&i	>	&First	%THEN
			%DO;
				else
				if year(appln_filing_date) = &i then output want_&i;
			%END;
		%ELSE	
			%DO;
				if year(appln_filing_date) = &i then output want_&i;
			%END;
	%END;
%MEND	Gen_If_Statements;

The first generates the output dataset names for the Data statement.  The second generates the If statements.  By the way, notice the "else" I've placed in with the If statements.  This else will make your code execute faster than the way that you had it.

 

We can test this code with the following Put statements, but we have to remove the semi-colons (temporarily) from the If statements or there will be an error.

%PUT %Gen_Dataset_Names(1986, 2020);
%PUT %Gen_If_Statements(1986, 2020);

Here's the result from the first.  It's just text.  Macros are not innately slow.

Want_1986   Want_1987   Want_1988   Want_1989   Want_1990   Want_1991   Want_1992   Want_1993   Want_1994   Want_1995   
Want_1996   Want_1997   Want_1998   Want_1999   Want_2000   Want_2001   Want_2002   Want_2003   Want_2004   Want_2005   
Want_2006   Want_2007   Want_2008   Want_2009   Want_2010   Want_2011   Want_2012   Want_2013   Want_2014   Want_2015   
Want_2016   Want_2017   Want_2018   Want_2019   Want_2020

Here's the result from the second.

if year(appln_filing_date) = 1986 then output want_1986     else     if year(appln_filing_date) = 1987 then output want_1987 
    else     if year(appln_filing_date) = 1988 then output want_1988     else     if year(appln_filing_date) = 1989 then 
output want_1989     else     if year(appln_filing_date) = 1990 then output want_1990     else     if 
year(appln_filing_date) = 1991 then output want_1991     else     if year(appln_filing_date) = 1992 then output want_1992    
 else     if year(appln_filing_date) = 1993 then output want_1993     else     if year(appln_filing_date) = 1994 then output 
want_1994     else     if year(appln_filing_date) = 1995 then output want_1995     else     if year(appln_filing_date) = 1996 then output want_1996     else     if year(appln_filing_date) = 1997 then output want_1997     else     if 
year(appln_filing_date) = 1998 then output want_1998     else     if year(appln_filing_date) = 1999 then output want_1999    
 else     if year(appln_filing_date) = 2000 then output want_2000     else     if year(appln_filing_date) = 2001 then output 
want_2001     else     if year(appln_filing_date) = 2002 then output want_2002     else     if year(appln_filing_date) = 
2003 then output want_2003     else     if year(appln_filing_date) = 2004 then output want_2004     else     if 
year(appln_filing_date) = 2005 then output want_2005     else     if year(appln_filing_date) = 2006 then output want_2006    
 else     if year(appln_filing_date) = 2007 then output want_2007     else     if year(appln_filing_date) = 2008 then output 
want_2008     else     if year(appln_filing_date) = 2009 then output want_2009     else     if year(appln_filing_date) = 
2010 then output want_2010     else     if year(appln_filing_date) = 2011 then output want_2011     else     if 
year(appln_filing_date) = 2012 then output want_2012     else     if year(appln_filing_date) = 2013 then output want_2013    
 else     if year(appln_filing_date) = 2014 then output want_2014     else     if year(appln_filing_date) = 2015 then output 
want_2015     else     if year(appln_filing_date) = 2016 then output want_2016     else     if year(appln_filing_date) = 
2017 then output want_2017     else     if year(appln_filing_date) = 2018 then output want_2018     else     if 
year(appln_filing_date) = 2019 then output want_2019     else     if year(appln_filing_date) = 2020 then output want_2020

It's not very tidy, but it will run (once we put the semi colons back in on the If statements).

 

The final program would look like the below.  I don't have your data otherwise I would run it.  I could have a typo or something, but this is  in general proper SAS code.

DATA	%Gen_Dataset_Names(1986, 2020);
	SET	Have;
	%Gen_If_Statements(1986, 2020);
RUN;

Jim

andreas_lds
Jade | Level 19

Why do you want to separate the dataset at all?

You could simplify the code by using call execute:

proc sql;
   create table work.years as
      select distinct year(appln_filing_date) as year
         from have;
quit;

data _null_;
   set work.years;
   
   call execute(cats('data work.want_', year, ';'));
   call execute('set work.have;');
   call execute(cats('where year(appln_filing_date)=', year, ';'));
   call execute('run;');   
run;
jimbarbour
Meteorite | Level 14

Very good, @andreas_lds!  That's an excellent idea to use Call Execute in that fashion.

 

One caution perhaps:  If it's a truly large dataset, then one could have a performance problem.  The entire dataset must be read for every one of the Call Executes, which, for years 1986 through 2020, would be about 35 times.  If, say, the dataset were 500,000,000 rows, then you could be in for a bit of a wait.  🙂

 

If one executes the code in a single Data step, as in my expanded reply, above, then the dataset would be read only once.  I like the compactness of your solution, and if there were a performance problem, then one could switch to a single pass solution such as the one I added to my initial reply.

 

Jim

andreas_lds
Jade | Level 19

I expect that using where to filter the observations can compensate the overhead partially, but you are right, a macro will most likely perform better, if the data is really large. Maybe @Alexxxxxxx can share some facts about the dimension of the dataset: number of variables, number of observations.

andreas_lds
Jade | Level 19

After some tests i have to adapt expectations to reality:

I created a dataset with 635000 obs and 21 variables. Splitting with call execute takes nearly five seconds, while using a macro the split was done less than half a second.

The macro:

%macro split;
   %local minYear maxYear;
   proc sql noprint;
      select min(year(appln_filing_date)), max(year(appln_filing_date)) 
         into :minYear trimmed, :maxYear trimmed
         from work.have;
   quit;
   
   %local year;
   
   data 
      %do year = &minYear. %to &maxYear.;
         work.want_&year.
      %end;
      ;
      
      set work.have;
      
      select(year(appln_filing_date));
         %do year = &minYear. %to &maxYear.;
            when (&year) output work.want_&year.;
         %end;
      end;      
   run;
%mend;

%split;
Kurt_Bremser
Super User

One dynamically created data step will always be faster than a series of data steps, one for every selection. In the case of years, one can create a macro that does the loops internally:

%macro split(start,end);
%local i;
data
%do i = &start. %to %end.;
  want&i.
%end;
;
set have;
%do i = &start. %to &end.;
if year = &i. then output want&i.;
%end;
run;
%mend;
%split

But still, splitting will move the contents of the dataset three times through I/O:

  1. to determine the split values (but only necessary if the edge values are not known beforehand)
  2. to read the dataset
  3. to write the split datasets

and all this before you have done ANY analysis.

 

In all my professional work with SAS, I have done this exactly once, and only because sorting a large dataset would have cracked my quota, preventing me from testing a production program in my personal environment (the batch user has no quotas). I split along the top-level sort variable, sorted/summarized the splits, and concatenated them back. It also has the side effect that the program can't eat too much work storage, possibly colliding with other batch jobs running concurrently.

jimbarbour
Meteorite | Level 14

@andreas_lds wrote:
      select(year(appln_filing_date));
         %do year = &minYear. %to &maxYear.;
            when (&year) output work.want_&year.;
         %end;

Nice use of the select statement.  👍

 

Regarding the use of:

   proc sql noprint;
      select min(year(appln_filing_date)), max(year(appln_filing_date)) 
         into :minYear trimmed, :maxYear trimmed
         from work.have;
   quit;

that's going to depend a bit on the business need, at least that's how it works in my company.  Our database has many years in it, but we typically only want the current year + the prior three years.  So, is it better to use SQL to detect which years are present or is it better to build the macro with years as arguments like the below?

%Gen_Dataset_Names(1986, 2020);

Well, as say, "it depends."  🙂

 

Now, as for the CALL EXECUTEs that you coded, even if it's not the fastest here, it's still a valuable technique to be aware of -- launching multiple steps from within a single Data step.  CALL EXECUTEs are run serially, i.e. one after the other.  Another option, instead of CALL EXECUTE is RSUBMIT (or SYSTASK) which launches processes in parallel, i.e. concurrently.  Parallel processing, depending on the requirements, can significantly speed overall processing.

 

Jim

Kurt_Bremser
Super User

Define "large" in terms of observation count and physical dataset size.

Which analysis do you intend to do? All analysis procedures allow the use of BY or CLASS, so splitting is not necessary.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1485 views
  • 13 likes
  • 4 in conversation