Suppose an original data has a class variable
data temp;
input CLASS VAR;
datalines;
1 1
1 2
1 .
2 3
2 6
;
run;
How can I create sub-datasets based on the CLASS variable in a rapid way? The above data should produce two sub-datasets that look like:
- Sub-dataset1:
CLASS VAR
1 1
1 2
1 .
- Sub-dataset2:
CLASS VAR
2 3
2 6
Okay, I see solutions that use MACROs, that use CALL EXECUTE, that use hash(), but I don't see a simple data step solution, such as:
data subdataset1 subdataset2;
input CLASS VAR;
if class=1 then output subdataset1;
else if class=2 then output subdataset2;
datalines;
1 1
1 2
1 .
2 3
2 6
;
run;
What am I missing? This works perfectly. Why resort to complicated SAS structures for what is obviously a beginner problem, when the most basic SAS structure (a data set) works fine?
One of a dynamic options could be Hash(), if your data is sorted by class or at least clustered by class:
data temp;
input CLASS VAR;
datalines;
1 1
1 2
1 .
2 3
2 6
;
run;
data _null_;
if _n_=1 then do;
declare hash h();
h.definekey('_n_');
h.definedata('class','var');
h.definedone();
end;
set temp;
by class notsorted;
rc=h.replace();
if last.class then do;
rc=h.output(dataset:cats('class_',class));
rc=h.clear();
end;
run;
Haikuo
I find this code very helpful. by the way, suppose the data set "temp" has one more variable named year, how to make the final datasets ordered by year? eg., the final datasets class_1 and class_2 are ordered by year. thanks!
You could also use:
DATA _NULL_;
SET HAVE END=EOF;
BY CLASS;
IF FIRST.CLASS;
CALL EXECUTE(COMPBL("DATA CLASS_" || COMPRESS(CLASS) || ";
SET HAVE (WHERE = (CLASS =" || CLASS || "));
RUN;"));
RUN;
@Scott, Did I miss something? Wa there a change in Call execute...
http://www.sascommunity.org/wiki/Call_Execute_in_the_Data_Step
@Jaap,
I am not sure what you are querying. If you run the previously attached code you will see the following in the log:
14 GOPTIONS ACCESSIBLE;
15 DATA WANT;
16 SET HAVE END=EOF;
17 BY CLASS;
18 IF FIRST.CLASS;
19 CALL EXECUTE(COMPBL("DATA CLASS_" || COMPRESS(CLASS) || ";
20 SET HAVE (WHERE = (CLASS =" || CLASS || "));
21 RUN;"));
22 RUN;
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
19:48 20:55
NOTE: There were 50 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 501 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: CALL EXECUTE generated line.
1 + DATA CLASS_1; SET HAVE (WHERE = (CLASS = 1)); RUN;
NOTE: There were 5 observations read from the data set WORK.HAVE.
WHERE class="1";
NOTE: The data set WORK.CLASS_1 has 5 observations and 501 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2 + DATA CLASS_2; SET HAVE (WHERE = (CLASS = 2)); RUN;
NOTE: There were 5 observations read from the data set WORK.HAVE.
WHERE class="2";
NOTE: The data set WORK.CLASS_2 has 5 observations and 501 variables.
This suggests that each datastep created via CALL EXECUTE is executed following the completion of the _NULL_ datastep, which is what this article is pointing out.
@ Scott, Ok I see your approach.
It is generating possible a lot of new data-step-s each selecting one part and creating one datastep each.
I was thinking of a lot of SAS-datasets being created while reading, in one pass!, the input (performance reasons).
This feature of splitting data in sas-data step is rather unique.
That is requiring:
- a list of the to be created datasets (defnition part)
- a list of the output statements in a selection
Here is my try by using macro...
%macro test;
proc sql noprint;
select distinct class into :list separated by "?"
from have;
select count(distinct class) into :tot
from have;
quit;
%do i = 1 %to &tot.;
data data&i.;
set have;
if class = %scan(&list.,&i.,"?");
run;
%end;
%mend;
%test
-Urvish
Okay, I see solutions that use MACROs, that use CALL EXECUTE, that use hash(), but I don't see a simple data step solution, such as:
data subdataset1 subdataset2;
input CLASS VAR;
if class=1 then output subdataset1;
else if class=2 then output subdataset2;
datalines;
1 1
1 2
1 .
2 3
2 6
;
run;
What am I missing? This works perfectly. Why resort to complicated SAS structures for what is obviously a beginner problem, when the most basic SAS structure (a data set) works fine?
Paigemiller, You are missing nothing, perhaps some minor requirements.
- Are the subclasses known? If No than is needed some addtional logic.
- Is the question about quick and easy coding or about a well performing behavior on bigger datasets.
- ... (minor)
Your approach is the most easy to understand that will deliver reliable and the best performance.
Indeed why should it made more complicated as needed. KISS.
It's not very dynamic, is it...?
A version of Urvish macro solution which should be slightly more efficient (fewer passes through the data):
%macro test;
proc sql noprint;
select distinct catx('_','T',class),
class
into :table_list separated by ' ', :list separated by ' '
from temp;
select count(distinct class) into :tot
from temp;
quit;
data &table_list.;
set temp;
select(class);
%do i = 1 %to &tot.;
when(%scan(&list.,&i.)) output %scan(&table_list.,&i);
%end;
otherwise;
end;
run;
%mend;
%test
It's not very dynamic, is it...?
At no point did the original question ask for a dynamic solution.
A version of Urvish macro solution which should be slightly more efficient (fewer passes through the data):
I'm not grasping why you say this. My solution requires one pass through the data. Your solution requires n passes through the data, if there are n different values of CLASS.
I agree that all of the above solutions work. Since it appears to me that a beginner has asked a relative simple question, I cannot see the benefit of a complex answer when a simple answer suffices. I think that's called Occam's toothbrush.
We often urge to have questions simplified so that they could be easily understood. The real life problem often has much more complex data. Therefore I suggested a more generic solution.
I did mean that my macro was slightly more efficient than Urvish (maybe I should put that in a separate post, sorry if I seemed rude). My code has 3 passes of the data, since Urvish had 2 + no distinct values of CLASS.
Razor :smileyconfused:
Aggreed, its not without limitation. Let me relabel it Fairly generic. ..
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.