Re: Split data set into multiple data sets based on one column

Ronein · Posted 12-14-2023 01:57 AM

Hello

I want to split data set into multiple data sets based on values in one column.

I do it in the following way.

I would like to see other ways to do it please .

Data have;
format P_date ddmmyy10.;
input P_date : date9. Y;
Year_Month=CATX('_',Year(P_date),month(P_date));
cards;
01JAN2023 10
03JAN2023 20
03JAN2023 21
03JAN2023 22
31JAN2023 30
01FEB2023 15
28FEB2023 20
28FEB2023 30
28FEB2023 34
15MAR2023 40
31MAR2023 45
17MAR2023 70
;
Run;

proc sql noprint ;                                                                                                                      
select distinct  Year_Month  ,CATX('_',"T",Year_Month) as tbl
into : V_Year_Month   SEPARATED by '+' ,:V_tbl separated by '+'                                                                                                                                                                                                                     
from  have                                                                                                                             
;                                                                                                                                      
quit ;  
%put &V_Year_Month;
%put &V_tbl;

proc sql noprint;
select count(distinct Year_Month) as n into : n
from have
;
quit;
%put &n;


%macro Split; 
%do j=1 %to &n.;
%let Year_Month=%scan(&V_Year_Month.,&j.,+);
%let tbl=%scan(&V_tbl.,&j.,+);
Data &tbl.;
SET have(Where=(Year_Month="&Year_Month."));
Run;
%end;
%mend;
%split

Kurt_Bremser · Posted 12-14-2023 03:17 AM

Why do you think you need to split the data for further processing? Please show what you are doing after the split.

Since statistical procedures respect formatted values, it may well be that you don't even need the additional variable.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 12-15-2023 12:30 PM

Let's say that this is the request of users to work with multiple data sets(Each data set per month) instead of one dat set.

It is a good question-

Let's say that you want to create dats of loans.

What would you preffer?

Create one data set that every month adding rows or every month create a new data set ?

yabwon · Posted 12-14-2023 03:39 AM

I would go with the macroArray package, like this:


%array(ds=have, vars=Year_Month|YM, macarray=Y)

options mprint;
data
  %do_over(YM, phrase=%nrstr(V_%YM(&_i_.)))
;

  set have;

  select(Year_Month);
    %do_over(YM, phrase=%nrstr(
     when ("%YM(&_i_.)") output V_%YM(&_i_.);
     ))
    otherwise put "unknown Year_Month value!" _N_=;
  end;
run;

log:

1    %array(ds=have, vars=Year_Month|YM, macarray=Y)
NOTE:[ARRAY] 3 macrovariables created
2
3    options mprint;
4    data
5      %do_over(YM, phrase=%nrstr(V_%YM(&_i_.)))
MPRINT(DO_OVER):   V_2023_1
MPRINT(DO_OVER):   V_2023_2
6    ;
7
8      set have;
9
10     select(Year_Month);
11       %do_over(YM, phrase=%nrstr(
12        when ("%YM(&_i_.)") output V_%YM(&_i_.);
13        ))
MPRINT(DO_OVER):   when ("2023_1") output V_2023_1;
MPRINT(DO_OVER):   when ("2023_2") output V_2023_2;
MPRINT(DO_OVER):   when ("2023_3") output V_2023_3;
14       otherwise put "unknown Year_Month value!" _N_=;
15     end;
16   run;

NOTE: There were 12 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.V_2023_1 has 5 observations and 3 variables.
NOTE: The data set WORK.V_2023_2 has 4 observations and 3 variables.
NOTE: The data set WORK.V_2023_3 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

BTW. with the data step's "select" statement and multiple "output" statements you go through input data only two times (first to get unique values, second to split data).

all the best

Bart

P.S. To install and use macroArray package do:

Enable the framework [first time only]:

filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */

Install the framework and the package on your machine in the folder you created:

filename packages "</your/directory/for/packages/>"; 
%installPackage(SPFinit macroArray)

From now on, in your SAS session just run it like this:

filename packages "</your/directory/for/packages/>";
%include packages(SPFinit.sas);

%loadPackage(macroArray)

Link to details.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ronein · Posted 12-15-2023 12:37 PM

Thanks,

What is YM??

Can you explain about the 3 arguments?

ds I see is the raw data set that I want to split

Year_Month is the coolumn by which I want to split the data set.

Where do you tell what will be the split data sets names?

ds=have, vars=Year_Month|YM, macarray=Y

yabwon · Posted 12-15-2023 03:54 PM

The construction:

Year_Month|YM

means: take variable "Year_Month", select only unique unique (the "|") values from it, and inset those values into a macro array: YM1, YM2, YM3, ..., YMn (where n is the number of unique values)

the "ds=" parameter of the %array() macro indicates from which dataset you want to take data to create the macro array YM,

the "macarray=Y" means except the YM1,..., YMn macro array, create also a %YM() macro which is then used in the %do_over() macro.

The data sets names are created in the %do_over()'s "phrase=" parameter, pay attention what is printed in the log in the MPRINT blocks. You will notice that the code used in the "phrase=" is:

when ("%YM(&_i_.)") output V_%YM(&_i_.);

so it calls the %YM() macro, and the result in the log shows:

MPRINT(DO_OVER):   when ("2023_1") output V_2023_1;
MPRINT(DO_OVER):   when ("2023_2") output V_2023_2;
MPRINT(DO_OVER):   when ("2023_3") output V_2023_3;

Detailed documentation on the %array() macro is here

Detailed documentation on the %do_over() macro is here

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

PaigeMiller · Posted 12-14-2023 05:57 AM

@Ronein wrote:

I want to split data set into multiple data sets based on values in one column.

In my opinion, this is an incredibly poor idea. You can leave everything in one data set, and then extract the part(s) you need when you don't need all of it (for example, by using WHERE), or better yet use the BY statement to perform analyses for each level of Year_Month.

Just because you can do something, does not mean you SHOULD do that thing.

--
Paige Miller

Ronein · Posted 12-15-2023 12:39 PM

What is the correct way to work?

Let's say that in the bank there are loans data and the bank produce monthly reports based on this data.

What is better ? Create one data set and every month add new rows to it? or every month create a new data set for example Loans_DEC2023?

The problem of creating one data set is by my opion that it can contain many rows and be heavy

ballardw · Posted 12-14-2023 09:39 AM

That particular "year_month" variable is even clumsier than typical. That will not sort in date order because you have have values like 2023_1 and 2023_10 which means that your "data sets" won't appear in date order either.

If you must use such character variables (no real proof shown that is needed at all) then at least control the conversion from numeric to character so things are consistent. This will create month portions of 01 02 ... 09 10 11 and 12 so at least these values will sort and appear in a reasonable order.

Year_Month=CATX('_',Year(P_date), put(month(P_date),z2.) );

Personally if I needed such values I would create a custom format to display them that way from the date. Or use one of the options of the YYMMXw. format such as YYMMD7.

Ronein · Posted 12-15-2023 12:44 PM

I agree

Year_Month=CATX('_',Year(P_date),put(month(P_date),Z2.));

is better than

Year_Month=CATX('_',Year(P_date),month(P_date));

May I ask-

If you add format to P_Date (For example YYMMD7. as you offer) than how would you split the data based on the criteria of year+month?

Reeza · Posted 12-15-2023 05:56 PM

Keep it in one data set, make sure you have indexes based on the year month.

Create a view for current_year month or standard periods of analysis. For example, most analysis at a company I worked at used 2 years historic plus current year so we had one slice with just that data and additionally a current_month slice via views. Both allowed us to hardcode reference dates in our reporting so that the data was automatically updated without needed to redefine data in reports. I do work at a company where we store monthly files as well, but that's because each months raw data is 50 million rows of data....

Tom · Posted 12-14-2023 12:11 PM

I wouldn't bother to but data into macro variables. Just use the data directly to generate the code.

Do you need to make empty datasets for YEAR_MONTH values that do not exist?

If not then just drive it from the existing values.

proc sort data=have(keep=Year_Month) nodupkey out=months;
  by Year_Month;
run;

Now use the list of values to generate the code. And use %INCLUDE to run it.

filename code temp;
data _null_;
  file code ;
  put 'data' ;
  do p=1 to nobs;
    set months point=p nobs=nobs;
    put 'T_' year_month ;
  end;
  put ';'/ 'set have;' ;
  do p=1 to nobs;
    set months point=p nobs=nobs;
    put 'if ' year_month=:$quote. 'then output t_' year_month ';' ;
  end;
  put 'run;'
  stop;
run;
%include code / source2;

So you get a program like:

data 
  t_2023_01
  t_2023_02
;
set have;
if year_month="2023_01" then output t_2023_01;
if year_month="2023_02" then output t_2023_02;
stop;
run;

If you are going to use macro code and generate a lot of macro variables and then use them to generate the code the it might look like this:

%macro split(dsname,varname,prefix);
%local i n name;
proc sql noprint ;
  select distinct &varname into :list1- from &dsname;
%let n=&sqlobs;
quit ;

data
%do i=1 %to &n;
  %let name=&prefix._&&list&i;
  &name.
%end;
;
  set &dsname;
%do i=1 %to &n;
  %let name=&prefix._&&list&i;
  if &varname = "&&list&i" then output &name. ;
%end;
  stop;
run;
%mend;

%split(dsname=have,varname=year_month,prefix=T)

Quentin · Posted 12-15-2023 01:13 PM

Agree with others that data splitting is rarely a good idea.

That said, if each of your groups can fit into a hash-table, then hash-table splitting, as illustrated by Paul Dorfman in e.g. http://www.lexjansen.com/nesug/nesug09/hw/HW04.pdf , allows this to be done in one pass of the data:

data have;
  format P_date ddmmyy10.;
  input P_date : date9. Y;
  Year_Month=put(P_date,yymmn6.) ;
cards;
01JAN2023 10
03JAN2023 20
03JAN2023 21
03JAN2023 22
31JAN2023 30
01FEB2023 15
28FEB2023 20
28FEB2023 30
28FEB2023 34
15MAR2023 40
31MAR2023 45
17MAR2023 70
;
run;

data _null_ ;
  if _n_=1 then do;
    declare hash h() ;
    h.definekey ("_n_") ;
    h.definedata ("p_date","Y","Year_Month") ;
    h.definedone () ;
  end;
  do _n_ = 1 by 1 until ( last.year_month) ;
    set have ;
    by year_month ;
    h.add() ;
  end ;
  h.output (dataset: cats("Want_",year_month) ) ;
  h.clear() ;
run ;

Splits into:

NOTE: The data set WORK.WANT_202301 has 5 observations and 3 variables.
NOTE: The data set WORK.WANT_202302 has 4 observations and 3 variables.
NOTE: The data set WORK.WANT_202303 has 3 observations and 3 variables.
NOTE: There were 12 observations read from the data set WORK.HAVE.

It's a lovely illustration of the power of the hash object. Never before was it possible to use a single DATA step to read data and dynamically determine the names of output datasets from that step.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Classroom Training Available!