Hello
I want to split data set into multiple data sets based on values in one column.
I do it in the following way.
I would like to see other ways to do it please .
Data have;
format P_date ddmmyy10.;
input P_date : date9. Y;
Year_Month=CATX('_',Year(P_date),month(P_date));
cards;
01JAN2023 10
03JAN2023 20
03JAN2023 21
03JAN2023 22
31JAN2023 30
01FEB2023 15
28FEB2023 20
28FEB2023 30
28FEB2023 34
15MAR2023 40
31MAR2023 45
17MAR2023 70
;
Run;
proc sql noprint ;
select distinct Year_Month ,CATX('_',"T",Year_Month) as tbl
into : V_Year_Month SEPARATED by '+' ,:V_tbl separated by '+'
from have
;
quit ;
%put &V_Year_Month;
%put &V_tbl;
proc sql noprint;
select count(distinct Year_Month) as n into : n
from have
;
quit;
%put &n;
%macro Split;
%do j=1 %to &n.;
%let Year_Month=%scan(&V_Year_Month.,&j.,+);
%let tbl=%scan(&V_tbl.,&j.,+);
Data &tbl.;
SET have(Where=(Year_Month="&Year_Month."));
Run;
%end;
%mend;
%split
Why do you think you need to split the data for further processing? Please show what you are doing after the split.
Since statistical procedures respect formatted values, it may well be that you don't even need the additional variable.
Let's say that this is the request of users to work with multiple data sets(Each data set per month) instead of one dat set.
It is a good question-
Let's say that you want to create dats of loans.
What would you preffer?
Create one data set that every month adding rows or every month create a new data set ?
I would go with the macroArray package, like this:
%array(ds=have, vars=Year_Month|YM, macarray=Y)
options mprint;
data
%do_over(YM, phrase=%nrstr(V_%YM(&_i_.)))
;
set have;
select(Year_Month);
%do_over(YM, phrase=%nrstr(
when ("%YM(&_i_.)") output V_%YM(&_i_.);
))
otherwise put "unknown Year_Month value!" _N_=;
end;
run;
log:
1 %array(ds=have, vars=Year_Month|YM, macarray=Y) NOTE:[ARRAY] 3 macrovariables created 2 3 options mprint; 4 data 5 %do_over(YM, phrase=%nrstr(V_%YM(&_i_.))) MPRINT(DO_OVER): V_2023_1 MPRINT(DO_OVER): V_2023_2 6 ; 7 8 set have; 9 10 select(Year_Month); 11 %do_over(YM, phrase=%nrstr( 12 when ("%YM(&_i_.)") output V_%YM(&_i_.); 13 )) MPRINT(DO_OVER): when ("2023_1") output V_2023_1; MPRINT(DO_OVER): when ("2023_2") output V_2023_2; MPRINT(DO_OVER): when ("2023_3") output V_2023_3; 14 otherwise put "unknown Year_Month value!" _N_=; 15 end; 16 run; NOTE: There were 12 observations read from the data set WORK.HAVE. NOTE: The data set WORK.V_2023_1 has 5 observations and 3 variables. NOTE: The data set WORK.V_2023_2 has 4 observations and 3 variables. NOTE: The data set WORK.V_2023_3 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
BTW. with the data step's "select" statement and multiple "output" statements you go through input data only two times (first to get unique values, second to split data).
all the best
Bart
P.S. To install and use macroArray package do:
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */
filename packages "</your/directory/for/packages/>";
%installPackage(SPFinit macroArray)
filename packages "</your/directory/for/packages/>";
%include packages(SPFinit.sas);
%loadPackage(macroArray)
Thanks,
What is YM??
Can you explain about the 3 arguments?
ds I see is the raw data set that I want to split
Year_Month is the coolumn by which I want to split the data set.
Where do you tell what will be the split data sets names?
ds=have, vars=Year_Month|YM, macarray=Y
The construction:
Year_Month|YM
means: take variable "Year_Month", select only unique unique (the "|") values from it, and inset those values into a macro array: YM1, YM2, YM3, ..., YMn (where n is the number of unique values)
the "ds=" parameter of the %array() macro indicates from which dataset you want to take data to create the macro array YM,
the "macarray=Y" means except the YM1,..., YMn macro array, create also a %YM() macro which is then used in the %do_over() macro.
The data sets names are created in the %do_over()'s "phrase=" parameter, pay attention what is printed in the log in the MPRINT blocks. You will notice that the code used in the "phrase=" is:
when ("%YM(&_i_.)") output V_%YM(&_i_.);
so it calls the %YM() macro, and the result in the log shows:
MPRINT(DO_OVER): when ("2023_1") output V_2023_1; MPRINT(DO_OVER): when ("2023_2") output V_2023_2; MPRINT(DO_OVER): when ("2023_3") output V_2023_3;
Detailed documentation on the %array() macro is here
Detailed documentation on the %do_over() macro is here
Bart
@Ronein wrote:
I want to split data set into multiple data sets based on values in one column.
In my opinion, this is an incredibly poor idea. You can leave everything in one data set, and then extract the part(s) you need when you don't need all of it (for example, by using WHERE), or better yet use the BY statement to perform analyses for each level of Year_Month.
Just because you can do something, does not mean you SHOULD do that thing.
What is the correct way to work?
Let's say that in the bank there are loans data and the bank produce monthly reports based on this data.
What is better ? Create one data set and every month add new rows to it? or every month create a new data set for example Loans_DEC2023?
The problem of creating one data set is by my opion that it can contain many rows and be heavy
That particular "year_month" variable is even clumsier than typical. That will not sort in date order because you have have values like 2023_1 and 2023_10 which means that your "data sets" won't appear in date order either.
If you must use such character variables (no real proof shown that is needed at all) then at least control the conversion from numeric to character so things are consistent. This will create month portions of 01 02 ... 09 10 11 and 12 so at least these values will sort and appear in a reasonable order.
Year_Month=CATX('_',Year(P_date), put(month(P_date),z2.) );
Personally if I needed such values I would create a custom format to display them that way from the date. Or use one of the options of the YYMMXw. format such as YYMMD7.
I agree
Year_Month=CATX('_',Year(P_date),put(month(P_date),Z2.));
is better than
Year_Month=CATX('_',Year(P_date),month(P_date));
May I ask-
If you add format to P_Date (For example YYMMD7. as you offer) than how would you split the data based on the criteria of year+month?
Keep it in one data set, make sure you have indexes based on the year month.
Create a view for current_year month or standard periods of analysis. For example, most analysis at a company I worked at used 2 years historic plus current year so we had one slice with just that data and additionally a current_month slice via views. Both allowed us to hardcode reference dates in our reporting so that the data was automatically updated without needed to redefine data in reports. I do work at a company where we store monthly files as well, but that's because each months raw data is 50 million rows of data....
I wouldn't bother to but data into macro variables. Just use the data directly to generate the code.
Do you need to make empty datasets for YEAR_MONTH values that do not exist?
If not then just drive it from the existing values.
proc sort data=have(keep=Year_Month) nodupkey out=months;
by Year_Month;
run;
Now use the list of values to generate the code. And use %INCLUDE to run it.
filename code temp;
data _null_;
file code ;
put 'data' ;
do p=1 to nobs;
set months point=p nobs=nobs;
put 'T_' year_month ;
end;
put ';'/ 'set have;' ;
do p=1 to nobs;
set months point=p nobs=nobs;
put 'if ' year_month=:$quote. 'then output t_' year_month ';' ;
end;
put 'run;'
stop;
run;
%include code / source2;
So you get a program like:
data
t_2023_01
t_2023_02
;
set have;
if year_month="2023_01" then output t_2023_01;
if year_month="2023_02" then output t_2023_02;
stop;
run;
If you are going to use macro code and generate a lot of macro variables and then use them to generate the code the it might look like this:
%macro split(dsname,varname,prefix);
%local i n name;
proc sql noprint ;
select distinct &varname into :list1- from &dsname;
%let n=&sqlobs;
quit ;
data
%do i=1 %to &n;
%let name=&prefix._&&list&i;
&name.
%end;
;
set &dsname;
%do i=1 %to &n;
%let name=&prefix._&&list&i;
if &varname = "&&list&i" then output &name. ;
%end;
stop;
run;
%mend;
%split(dsname=have,varname=year_month,prefix=T)
Agree with others that data splitting is rarely a good idea.
That said, if each of your groups can fit into a hash-table, then hash-table splitting, as illustrated by Paul Dorfman in e.g. http://www.lexjansen.com/nesug/nesug09/hw/HW04.pdf , allows this to be done in one pass of the data:
data have;
format P_date ddmmyy10.;
input P_date : date9. Y;
Year_Month=put(P_date,yymmn6.) ;
cards;
01JAN2023 10
03JAN2023 20
03JAN2023 21
03JAN2023 22
31JAN2023 30
01FEB2023 15
28FEB2023 20
28FEB2023 30
28FEB2023 34
15MAR2023 40
31MAR2023 45
17MAR2023 70
;
run;
data _null_ ;
if _n_=1 then do;
declare hash h() ;
h.definekey ("_n_") ;
h.definedata ("p_date","Y","Year_Month") ;
h.definedone () ;
end;
do _n_ = 1 by 1 until ( last.year_month) ;
set have ;
by year_month ;
h.add() ;
end ;
h.output (dataset: cats("Want_",year_month) ) ;
h.clear() ;
run ;
Splits into:
NOTE: The data set WORK.WANT_202301 has 5 observations and 3 variables. NOTE: The data set WORK.WANT_202302 has 4 observations and 3 variables. NOTE: The data set WORK.WANT_202303 has 3 observations and 3 variables. NOTE: There were 12 observations read from the data set WORK.HAVE.
It's a lovely illustration of the power of the hash object. Never before was it possible to use a single DATA step to read data and dynamically determine the names of output datasets from that step.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.