Solved: Re: Outputting only certain observations

triley · Posted 02-25-2014 03:38 PM

Overall, I'm trying to create multiple datasets using observations from 1 dataset. So let's say I have 20,000 observations in a dataset. What I'm looking to accomplish is to grab just the first 5000 observations and output that to a particular dataset. Then i want to grab the next 5000 from the original dataset and output that to another dataset. So dataset 'a' would have records 1 - 5000, and dataset 'b' would have 5001 - 10000. I'm sure there is some 1 step process to do this but I can't figure out a simple way. Any help is appreciated!

Feel free to use this (In this case, I would want dataset 'b' to obtain the first 5000, then dateset 'c' to have the next 5000)

data a;

do i = 1 to 20000;

output;

end;

run;

data b;

= first 5000 records;

data c;

= next 5000 records;

Reeza · Posted 02-25-2014 04:33 PM

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

View solution in original post

Reeza · Posted 02-25-2014 03:47 PM

Search on here and check the following link, there's several1-step solutions to this, with the BEST PRACTICE: DON'T DO IT.

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

ballardw · Posted 02-25-2014 04:16 PM

What will you do with the separate datasets? If I needed to chunk my data up in such a fashion I would be tempted to put in an order number variable and create an appropriate format to allow addressing / analyzing chunks.

triley · Posted 02-25-2014 04:21 PM

I need smaller datasets because further down the line I do iterations off of each record. So if I end up doing 1000 iterations per record, if I have datasets of 5000 records instead of 200,000, it will save me a lot of processing time and it wont create massive tables with 1 billion records.

Astounding · Posted 02-25-2014 04:26 PM

Only you know the contents of your "later application", but it seems likely that it could work by doing this instead of using separate data sets:

set have (firstobs=5001 obs=10000);

ballardw · Posted 02-25-2014 04:29 PM

You can conditionally use the _N_ variable instead of separate data sets.

data junk;

set have;

If 1 le _n_ le 5000 then do;

End;

run;

Reeza · Posted 02-25-2014 04:33 PM

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

triley · Posted 02-25-2014 04:34 PM

I may not have explained the question well...which I do sometimes. The iteration portion doesn't matter I already have the done and it is not part of what I'm looking to accomplish here.

An incorrect but good example of what I'm trying to do is as follows:

data a1 a2 a3...etc;

set a;

if _n_ <= 5000 then output a1;

if 5000 < _n_ <= 10000 then output a2;

if 10000 < _n_ <= 15000 then output a3;

etc...

run;

yeshwanth · Posted 02-25-2014 05:01 PM

I felt the problem interesting and tried it using a macro.

I took a dataset f1 with 10 observations and divided it into two parts each having 5 observation.

Note: This macro will work for any number of partitions, just specify the total no of data sets needed( in this case 2) and total no of observations in each data set( in this case 5).

If you want to create 50 datasets with 5000 observations replace my values and I believe in a single run you can generate 50 datasets.

options mprint;

data f1;
input id;
cards;
1
2
3
4
5
6
7
8
9
10
;
run;

%macro yesh(a);
%do i=1 %to &a;
%let k=%eval((&i-1)*5); /* Lower Limit*/
%let g=%eval(&i*5); /* Upper limit*/
data a_&i;
set f1;
if _n_<=&g and _n_ >= &k then output;
run;
%end;
%mend;

%yesh(2) /* Total number of partitions*/

Hope this helps !!

Haikuo · Posted 02-25-2014 05:28 PM

Using Hash can give you a true dynamic way of splitting data:

data have;

do i=1 to 45;

output;

end;

run;

data _null_;

declare hash h(ordered:'a');

h.definekey('n');

h.definedata('i');

h.definedone();

do n=1 to 10 until (last);

set have end=last;

h.replace();

end;

h.output(dataset:cats('want',_n_));

run;

Haikuo

Registration is open

SAS Training: Just a Click Away