Solved: How to split one dataset too many based on the values (categorical) on...

georgel · Posted 12-27-2020 06:55 AM

Hi

I am trying to split the data set into many here is the data have :

data have;
input cusip_id Name $10. date_issue :ddmmyy10. type $10.;
format date_issue ddmmyy10.;
datalines;
10343452 Zingas    01/04/2016 bench
10343489 Stephanie 01/04/2014 main
10343445 Olivia    01/04/2012 main
10343456 Amstel    01/04/2011 bench
10343457 KEO       01/04/2017 main
;
run;

My solution:

data data_main;
       set have;
       If  type="main";
run;

data data_bech;
       set have;
       If  type="bench";
run;

This is in fact an awkard/cumbersome solution. I would like to utilize a more efficient algorithm for advanced SAS programming to avoid repetition of code. e.g. using proc sql distinct to set the values in a macro variable

Please I am looking forward from your prompt reply

singhsahab · Posted 12-27-2020 11:22 AM

proc sql noprint;
select count(distinct(type)) into :c from have;
select distinct type into :typ1 - :typ%trim(%left(&c)) from have;
quit;

%macro create_table;
%do i=1 %to &c;

 data &&typ&i;
 set have ;
 if upcase(type) eq %upcase("&&typ&i");
 run;

%end;
%mend;

%create_table;

View solution in original post

RichardDeVen · Posted 12-27-2020 07:24 AM

Generally you won't want to split a data set in such a manner. Learning to judiciously code BY, CLASS, and WHERE statements will give you great control over slicing and dicing in the reporting and analytic procedures you use.

Depending on the size of the original file, you can use a hash of hash approach to dynamically split a table into value named sub tables.

From samples page

/* Richard A. DeVenezia
 * www.devenezia.com
 * 25May2004
 *
 * Similar to material originally posted to SAS-L on May 24, 2004
 * "Hash of hashes and dynamic output"
 */

* sample data with several groups;

data have0;
  do group = 'a','c','d','e';
    do _n_ = 1 to 100;
      date = today() - rand('integer', 365);
      satellite_1 + 1;
      satellite_2 = 0;
      satellite_A = 'foo bar';
      output;
    end;
  end;
  format date date9.;
run;

* disorder for good measure;

proc sql;
  create table have as select * from have0 order by ranuni(1);
quit;

* split one data set into several data sets named according to variable group;

data _null_;
  declare hash h0 ();
  h0.defineKey ('group');
  h0.defineData ('group', 'h');  * hash of hashes, data element h is a hash object;
  h0.defineDone ();

  declare hash h;

  do _n_ = 1 by 1 until (endOfDataset);
    set have end = endOfDataset;

    if group ne lag(group) then 
      if 0 ne h0.find() then do;
        * create a hash for the group that will be a sub data set;
        h = _new_ hash (ordered:'a');
        h.defineKey ('_n_');
        h.defineData ('group', 'date', 'satellite_1', 'satellite_2', 'satellite_A');
        h.defineDone ();
        h0.add();
      end;

    * add data to the groups hash;
    h.replace ();
  end;

  declare hiter hi ('h0');

  * create one table for each group;
  do while (hi.next() = 0);
    h.output (dataset:'group_'||group);
  end;
run;

Other approaches would use a pre-step to generate source code that is a wallpaper of if/then or select/when statements that output to sub tables named according to group value. The pre step would either put the source code in macro variable(s) or write the source code to a file that is %included.

PaigeMiller · Posted 12-27-2020 07:42 AM

Unless the size of the data set is humongous, keeping it as one is usually a better approach. Then you do the next steps (analysis, reporting, plotting) using a BY statement, or excluding specific categories from the analysis.

--
Paige Miller

Kurt_Bremser · Posted 12-27-2020 08:08 AM

The static solution is this:

data
  data_main
  data_bench
;
set have;
select (type);
  when ("main") output data_main;
  when ("bench") output data_bench;
  otherwise;
end;
run;

This can be made dynamic:

proc sort
  data=have (keep=type)
  out=control
  nodupkey
;
by type;
run;

filename incfile temp;

data _null_;
file incfile;
length line $100;
put "data";
do until eof1;
  set control end=eof1;
  line = " data_" !! type;
  put line;
end:
put ";";
put "select (type);";
do until eof2;
  set control end=eof2;
  line = "when ('" !! strip(type) !! "') output data_" !! strip(type) !! ";";
  put line;
end;
put "end;run;";
run;

%inc incfile;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

singhsahab · Posted 12-27-2020 11:22 AM

proc sql noprint;
select count(distinct(type)) into :c from have;
select distinct type into :typ1 - :typ%trim(%left(&c)) from have;
quit;

%macro create_table;
%do i=1 %to &c;

 data &&typ&i;
 set have ;
 if upcase(type) eq %upcase("&&typ&i");
 run;

%end;
%mend;

%create_table;

Tom · Posted 12-27-2020 11:54 AM

There is no need to run the query twice just to count how many values it returns. PROC SQL will count them for you.

proc sql noprint;
  select distinct type into :typ1 -  from have;
%let c=&sqlobs;
quit;

georgel · Posted 12-27-2020 02:42 PM

Tom thank you so much for the simplification

georgel · Posted 12-27-2020 02:40 PM

ABSOLUTELY PERFECT! THANKS A MILLION singhsahab

Kurt_Bremser · Posted 12-27-2020 05:24 PM

Please note that @singhsahab 's solution will read the.whole "have" dataset for every distinct value of type. With a large dataset and a high cardinality, this can become VERY time consuming.

My solution reads "have" only twice (sort and the dynamically created data step), regardless of the cardinality of type.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 12-27-2020 03:24 PM

If I may, I would do it with one data read with help of the SELECT statement:

data have;
input cusip_id Name $10. date_issue :ddmmyy10. type $10.;
format date_issue ddmmyy10.;
datalines;
10343452 Zingas    01/04/2016 bench
10343489 Stephanie 01/04/2014 main
10343445 Olivia    01/04/2012 main
10343456 Amstel    01/04/2011 bench
10343457 KEO       01/04/2017 main
;
run;

proc sql noprint;
select distinct type 
into :typ1 - 
from have;
%let c = &sqlobs.;
quit;

%macro create_table_with_1_read();

 data 
  %do i=1 %to &c;
    &&typ&i
  %end;
  ;
 set have ;

  select;
    %do i=1 %to &c;
      when (upcase(type) eq %upcase("&&typ&i")) output &&typ&i;
    %end;
    otherwise put "ERROR: unknown type";
  end;
 run;

%mend create_table_with_1_read;

%create_table_with_1_read()

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

georgel · Posted 12-27-2020 09:49 PM

Bart Many thanks for the solution. I have tested it. It works perfectly.

RichardDeVen · Posted 12-28-2020 03:21 PM

Thinking in the large, if you order the types by frequency count descending you will have the most common true whens first, reducing the number of tests that are done in your submitted codegen.

proc sql noprint;
select type
into :typ1-
from
( select type, count(*) as freq
  from have
  group by type
)
order by freq desc
;
%let c = &sqlobs.;
quit;

%put &=typ1;  *main (freq=3);
%put &=typ2;  *bench (freq=2);

Yanni_C · Posted 11-11-2022 02:24 AM

From the question above, if I want to get the datasets with name 'dataset1, dataset2, dataset3,……,dataset100', rather than naming the datasets by their observation's names. Could you please tell me how should I modify the code?

Kurt_Bremser · Posted 11-11-2022 02:45 AM

First question, as always in such cases: why do you think it necessary to split your dataset like this?

Second, what would be the rule for putting particular source observations into a particular target dataset?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Yanni_C · Posted 11-12-2022 01:04 AM

1. Since when I try to do rolling regression, there will be a 'sort failed' error, maybe because the amount of data is too large,

2. I want to split the dataset based on the different observations in 'characteristics', but I don't want to name the dataset using the name of the observations, I want to use dataset1, dataset2, dataset3,...

How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

Re: How to split one dataset too many based on the values (categorical) on one column?

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away