Solved
Contributor
Posts: 50

Outputting only certain observations

Overall, I'm trying to create multiple datasets using observations from 1 dataset. So let's say I have 20,000 observations in a dataset. What I'm looking to accomplish is to grab just the first 5000 observations and output that to a particular dataset. Then i want to grab the next 5000 from the original dataset and output that to another dataset. So dataset 'a' would have records 1 - 5000, and dataset 'b' would have 5001 - 10000. I'm sure there is some 1 step process to do this but I can't figure out a simple way. Any help is appreciated!

Feel free to use this (In this case, I would want dataset 'b' to obtain the first 5000, then dateset 'c' to have the next 5000)

data a;

do i = 1 to 20000;

output;

end;

run;

data b;

= first 5000 records;

data c;

= next 5000 records;

Accepted Solutions
Solution
‎02-25-2014 04:33 PM
Super User
Posts: 23,778

Re: Outputting only certain observations

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

All Replies
Super User
Posts: 23,778

Re: Outputting only certain observations

Search on here and check the following link, there's several1-step solutions to this, with the BEST PRACTICE: DON'T DO IT.

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

Super User
Posts: 13,583

Re: Outputting only certain observations

What will you do with the separate datasets? If I needed to chunk my data up in such a fashion I would be tempted to put in an order number variable and create an appropriate format to allow addressing / analyzing chunks.

Contributor
Posts: 50

Re: Outputting only certain observations

I need smaller datasets because further down the line I do iterations off of each record. So if I end up doing 1000 iterations per record, if I have datasets of 5000 records instead of 200,000, it will save me a lot of processing time and it wont create massive tables with 1 billion records.

Super User
Posts: 6,785

Re: Outputting only certain observations

Only you know the contents of your "later application", but it seems likely that it could work by doing this instead of using separate data sets:

set have (firstobs=5001 obs=10000);

Super User
Posts: 13,583

Re: Outputting only certain observations

You can conditionally use the _N_ variable instead of separate data sets.

data junk;

set have;

If 1 le _n_ le 5000 then do;

<code that does what ever you mean by iterations off of each record>

End;

run;

Solution
‎02-25-2014 04:33 PM
Super User
Posts: 23,778

Re: Outputting only certain observations

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

Contributor
Posts: 50

Re: Outputting only certain observations

I may not have explained the question well...which I do sometimes. The iteration portion doesn't matter I already have the done and it is not part of what I'm looking to accomplish here.

An incorrect but good example of what I'm trying to do is as follows:

data a1 a2 a3...etc;

set a;

if _n_ <= 5000 then output a1;

if 5000 < _n_ <= 10000 then output a2;

if 10000 < _n_ <= 15000 then output a3;

etc...

run;

Contributor
Posts: 22

Re: Outputting only certain observations

I felt the problem interesting and tried it using a macro.

I took a dataset f1 with 10 observations and divided it into two parts each having 5 observation.

Note: This macro will work for any number of partitions, just specify the total no of data sets needed( in this case 2) and total no of observations in each data set( in this case 5).

If you want to create 50 datasets with 5000 observations replace my values and I believe in a single run you can generate 50 datasets.

options mprint;

data f1;
input id;
cards;
1
2
3
4
5
6
7
8
9
10
;
run;

%macro yesh(a);
%do i=1 %to &a;
%let k=%eval((&i-1)*5);  /* Lower Limit*/
%let g=%eval(&i*5);  /* Upper limit*/
data a_&i;
set f1;
if _n_<=&g and _n_ >= &k then output;
run;
%end;
%mend;

%yesh(2)  /* Total number of partitions*/

Hope this helps !!

Posts: 3,167

Re: Outputting only certain observations

Using Hash can give you a true dynamic way of splitting data:

data have;

do i=1 to 45;

output;

end;

run;

data _null_;

declare hash h(ordered:'a');

h.definekey('n');

h.definedata('i');

h.definedone();

do n=1 to 10 until (last);

set have end=last;

h.replace();

end;

h.output(dataset:cats('want',_n_));

run;

Haikuo

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
• 9 replies
• 298 views
• 6 likes
• 6 in conversation