BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
triley
Obsidian | Level 7

Overall, I'm trying to create multiple datasets using observations from 1 dataset. So let's say I have 20,000 observations in a dataset. What I'm looking to accomplish is to grab just the first 5000 observations and output that to a particular dataset. Then i want to grab the next 5000 from the original dataset and output that to another dataset. So dataset 'a' would have records 1 - 5000, and dataset 'b' would have 5001 - 10000. I'm sure there is some 1 step process to do this but I can't figure out a simple way. Any help is appreciated!

Feel free to use this (In this case, I would want dataset 'b' to obtain the first 5000, then dateset 'c' to have the next 5000)

data a;

do i = 1 to 20000;

output;

end;

run;

data b;

= first 5000 records;

data c;

= next 5000 records;

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

View solution in original post

9 REPLIES 9
Reeza
Super User

Search on here and check the following link, there's several1-step solutions to this, with the BEST PRACTICE: DON'T DO IT.

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

ballardw
Super User

What will you do with the separate datasets? If I needed to chunk my data up in such a fashion I would be tempted to put in an order number variable and create an appropriate format to allow addressing / analyzing chunks.

triley
Obsidian | Level 7

I need smaller datasets because further down the line I do iterations off of each record. So if I end up doing 1000 iterations per record, if I have datasets of 5000 records instead of 200,000, it will save me a lot of processing time and it wont create massive tables with 1 billion records.

Astounding
PROC Star

Only you know the contents of your "later application", but it seems likely that it could work by doing this instead of using separate data sets:

set have (firstobs=5001 obs=10000);

ballardw
Super User

You can conditionally use the _N_ variable instead of separate data sets.

data junk;

     set have;

If 1 le _n_ le 5000 then do;

     <code that does what ever you mean by iterations off of each record>

End;

run;

Reeza
Super User

One way:

I recommend a naming convention though, so you can use the colon operator later to to reference the full set, ie subset_a, subset_b, subset_c can be referenced as subset_:

data subset_a subset_b subset_c subset_d;

set have;

if _n_ < 5000 then output subset_a;

else if _n_<10000 then output subset_b;

...

run;

triley
Obsidian | Level 7

I may not have explained the question well...which I do sometimes. The iteration portion doesn't matter I already have the done and it is not part of what I'm looking to accomplish here.

An incorrect but good example of what I'm trying to do is as follows:

data a1 a2 a3...etc;

set a;

if _n_ <= 5000 then output a1;

if 5000 < _n_ <= 10000 then output a2;

if 10000 < _n_ <= 15000 then output a3;

etc...

run;

yeshwanth
Fluorite | Level 6

I felt the problem interesting and tried it using a macro.

I took a dataset f1 with 10 observations and divided it into two parts each having 5 observation.

Note: This macro will work for any number of partitions, just specify the total no of data sets needed( in this case 2) and total no of observations in each data set( in this case 5).

          If you want to create 50 datasets with 5000 observations replace my values and I believe in a single run you can generate 50 datasets.

options mprint;

data f1;
input id;
cards;
1
2
3
4
5
6
7
8
9
10
;
run;

%macro yesh(a);
%do i=1 %to &a;
%let k=%eval((&i-1)*5);  /* Lower Limit*/
%let g=%eval(&i*5);  /* Upper limit*/
data a_&i;
set f1;
if _n_<=&g and _n_ >= &k then output;
run;
%end;
%mend;

%yesh(2)  /* Total number of partitions*/

Hope this helps !!

Haikuo
Onyx | Level 15

Using Hash can give you a true dynamic way of splitting data:

data have;

  do i=1 to 45;

    output;

  end;

run;

data _null_;

  declare hash h(ordered:'a');

  h.definekey('n');

  h.definedata('i');

  h.definedone();

  do n=1 to 10 until (last);

     set have end=last;

       h.replace();

  end;

    h.output(dataset:cats('want',_n_));

run;

Haikuo

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1306 views
  • 6 likes
  • 6 in conversation