BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
georgel
Quartz | Level 8

Hi 

 

I am trying to split the data set into many here is the data have :

data have;
input cusip_id Name $10. date_issue :ddmmyy10. type $10.;
format date_issue ddmmyy10.;
datalines;
10343452 Zingas    01/04/2016 bench
10343489 Stephanie 01/04/2014 main
10343445 Olivia    01/04/2012 main
10343456 Amstel    01/04/2011 bench
10343457 KEO       01/04/2017 main
;
run;

My solution: 

data data_main;
       set have;
       If  type="main";
run;

data data_bech;
       set have;
       If  type="bench";
run;

This is in fact an awkard/cumbersome solution. I would like to utilize a more efficient algorithm for advanced SAS programming to avoid repetition of code. e.g. using proc sql distinct to set the values in a macro variable

Please I am looking forward from your prompt reply

1 ACCEPTED SOLUTION

Accepted Solutions
singhsahab
Lapis Lazuli | Level 10
proc sql noprint;
select count(distinct(type)) into :c from have;
select distinct type into :typ1 - :typ%trim(%left(&c)) from have;
quit;

%macro create_table;
%do i=1 %to &c;

 data &&typ&i;
 set have ;
 if upcase(type) eq %upcase("&&typ&i");
 run;

%end;
%mend;

%create_table;

View solution in original post

18 REPLIES 18
RichardDeVen
Barite | Level 11

Generally you won't want to split a data set in such a manner.  Learning to judiciously code BY, CLASS, and WHERE statements will give you great control over slicing and dicing in the reporting and analytic procedures you use.

 

Depending on the size of the original file, you can use a hash of hash approach to dynamically split a table into value named sub tables.

 

From samples page

/* Richard A. DeVenezia
 * www.devenezia.com
 * 25May2004
 *
 * Similar to material originally posted to SAS-L on May 24, 2004
 * "Hash of hashes and dynamic output"
 */

* sample data with several groups;

data have0;
  do group = 'a','c','d','e';
    do _n_ = 1 to 100;
      date = today() - rand('integer', 365);
      satellite_1 + 1;
      satellite_2 = 0;
      satellite_A = 'foo bar';
      output;
    end;
  end;
  format date date9.;
run;

* disorder for good measure;

proc sql;
  create table have as select * from have0 order by ranuni(1);
quit;

* split one data set into several data sets named according to variable group;

data _null_;
  declare hash h0 ();
  h0.defineKey ('group');
  h0.defineData ('group', 'h');  * hash of hashes, data element h is a hash object;
  h0.defineDone ();

  declare hash h;

  do _n_ = 1 by 1 until (endOfDataset);
    set have end = endOfDataset;

    if group ne lag(group) then 
      if 0 ne h0.find() then do;
        * create a hash for the group that will be a sub data set;
        h = _new_ hash (ordered:'a');
        h.defineKey ('_n_');
        h.defineData ('group', 'date', 'satellite_1', 'satellite_2', 'satellite_A');
        h.defineDone ();
        h0.add();
      end;

    * add data to the groups hash;
    h.replace ();
  end;

  declare hiter hi ('h0');

  * create one table for each group;
  do while (hi.next() = 0);
    h.output (dataset:'group_'||group);
  end;
run;

Other approaches would use a pre-step to generate source code that is a wallpaper of if/then or select/when statements that output to sub tables named according to group value. The pre step would either put the source code in macro variable(s) or write the source code to a file that is %included.

PaigeMiller
Diamond | Level 26

Unless the size of the data set is humongous, keeping it as one is usually a better approach. Then you do the next steps (analysis, reporting, plotting) using a BY statement, or excluding specific categories from the analysis.

--
Paige Miller
Kurt_Bremser
Super User

The static solution is this:

data
  data_main
  data_bench
;
set have;
select (type);
  when ("main") output data_main;
  when ("bench") output data_bench;
  otherwise;
end;
run;

This can be made dynamic:

proc sort
  data=have (keep=type)
  out=control
  nodupkey
;
by type;
run;

filename incfile temp;

data _null_;
file incfile;
length line $100;
put "data";
do until eof1;
  set control end=eof1;
  line = " data_" !! type;
  put line;
end:
put ";";
put "select (type);";
do until eof2;
  set control end=eof2;
  line = "when ('" !! strip(type) !! "') output data_" !! strip(type) !! ";";
  put line;
end;
put "end;run;";
run;

%inc incfile;

Untested, posted from my tablet.

singhsahab
Lapis Lazuli | Level 10
proc sql noprint;
select count(distinct(type)) into :c from have;
select distinct type into :typ1 - :typ%trim(%left(&c)) from have;
quit;

%macro create_table;
%do i=1 %to &c;

 data &&typ&i;
 set have ;
 if upcase(type) eq %upcase("&&typ&i");
 run;

%end;
%mend;

%create_table;
Tom
Super User Tom
Super User

There is no need to run the query twice just to count how many values it returns.  PROC SQL will count them for you.

proc sql noprint;
  select distinct type into :typ1 -  from have;
%let c=&sqlobs;
quit;
georgel
Quartz | Level 8
Tom thank you so much for the simplification
georgel
Quartz | Level 8
ABSOLUTELY PERFECT! THANKS A MILLION singhsahab
Kurt_Bremser
Super User

Please note that @singhsahab 's solution will read the.whole "have" dataset for every distinct value of type. With a large dataset and a high cardinality, this can become VERY time consuming.

My solution reads "have" only twice (sort and the dynamically created data step), regardless of the cardinality of type.

yabwon
Onyx | Level 15

If I may, I would do it with one data read with help of the SELECT statement:

data have;
input cusip_id Name $10. date_issue :ddmmyy10. type $10.;
format date_issue ddmmyy10.;
datalines;
10343452 Zingas    01/04/2016 bench
10343489 Stephanie 01/04/2014 main
10343445 Olivia    01/04/2012 main
10343456 Amstel    01/04/2011 bench
10343457 KEO       01/04/2017 main
;
run;

proc sql noprint;
select distinct type 
into :typ1 - 
from have;
%let c = &sqlobs.;
quit;

%macro create_table_with_1_read();

 data 
  %do i=1 %to &c;
    &&typ&i
  %end;
  ;
 set have ;

  select;
    %do i=1 %to &c;
      when (upcase(type) eq %upcase("&&typ&i")) output &&typ&i;
    %end;
    otherwise put "ERROR: unknown type";
  end;
 run;

%mend create_table_with_1_read;

%create_table_with_1_read()

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



georgel
Quartz | Level 8
Bart Many thanks for the solution. I have tested it. It works perfectly.
RichardDeVen
Barite | Level 11

Thinking in the large, if you order the types by frequency count descending you will have the most common true whens first, reducing the number of tests that are done in your submitted codegen.

 

proc sql noprint;
select type
into :typ1-
from
( select type, count(*) as freq
  from have
  group by type
)
order by freq desc
;
%let c = &sqlobs.;
quit;

%put &=typ1;  *main (freq=3);
%put &=typ2;  *bench (freq=2);

Yanni_C
Fluorite | Level 6

From the question above, if I want to get the datasets with name 'dataset1, dataset2, dataset3,……,dataset100', rather than naming the datasets by their observation's names. Could you please tell me how should I modify the code?

Kurt_Bremser
Super User

First question, as always in such cases: why do you think it necessary to split your dataset like this?

Second, what would be the rule for putting particular source observations into a particular target dataset?

Yanni_C
Fluorite | Level 6

1. Since when I try to do rolling regression, there will be a 'sort failed' error, maybe because the amount of data is too large, 

Yanni_C_0-1668232863863.png

2. I want to split the dataset based on the different observations in 'characteristics', but I don't want to name the dataset using the name of the observations, I want to use dataset1, dataset2, dataset3,...

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 2798 views
  • 16 likes
  • 9 in conversation