BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Phil_NZ
Barite | Level 11

Hi all SAS Users,

 

These two days I deal with a 128 GB file, and I fell super hopeless because just a very simple code, it takes around 30 minutes and still be there (around 12 million obs in this file). So now I decide to cut it into many small pieces and deal with each of them.

 

I have three questions here:

1. How to check with how many observations, my computer can perform reasonably, or just assign a number and test until we feel the computer being slow gradually?

 

2. How to subsample from observation N1 to N2.

For example, the code to retrieve the first 10000 obs is as below:

data want;
  set have (obs=10000);
run;

So, what if I want a new dataset from obs 10001 to 20000, what code I should perform?

3. A further question of my curiosity only, I have 11 million observations in the file have, is there any way to split this whole dataset to many subsamples that each subsample contains 1 million observation?

 

Warmest regards.

 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@Phil_NZ wrote:

Hi all SAS Users,

....

2. How to subsample from observation N1 to N2.

For example, the code to retrieve the first 10000 obs is as below:

data want;
  set have (obs=10000);
run;

So, what if I want a new dataset from obs 10001 to 20000, what code I should perform?

  

data want;
  set have (firstobs=10001 obs=20000);
run;

 

 

3. A further question of my curiosity only, I have 11 million observations in the file have, is there any way to split this whole dataset to many subsamples that each subsample contains 1 million observation?

You could define 11 data set VIEWS.  They take no space but would (presumably) save you lots of typing when analyzing 1,000,000 obs at a time.

 

data have1 /view=have1;
  set have (obs=1000000);
run;
data have2 /view=have2;
  set have (firstobs=1000001 obs=2000000);
run;
.....
data have 11/view=have11;
  set have (firstobs=10000001);
run;

Then you can selectively run you analyses on HAVE1, HAVE2, etc.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

@Phil_NZ wrote:

Hi all SAS Users,

....

2. How to subsample from observation N1 to N2.

For example, the code to retrieve the first 10000 obs is as below:

data want;
  set have (obs=10000);
run;

So, what if I want a new dataset from obs 10001 to 20000, what code I should perform?

  

data want;
  set have (firstobs=10001 obs=20000);
run;

 

 

3. A further question of my curiosity only, I have 11 million observations in the file have, is there any way to split this whole dataset to many subsamples that each subsample contains 1 million observation?

You could define 11 data set VIEWS.  They take no space but would (presumably) save you lots of typing when analyzing 1,000,000 obs at a time.

 

data have1 /view=have1;
  set have (obs=1000000);
run;
data have2 /view=have2;
  set have (firstobs=1000001 obs=2000000);
run;
.....
data have 11/view=have11;
  set have (firstobs=10000001);
run;

Then you can selectively run you analyses on HAVE1, HAVE2, etc.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

1) No idea, your hardware, your network limitations

 

2) See if this gives you an idea:

data junk1;
   set sashelp.class (firstobs=1 obs=5);
run;
data junk2;
   set sashelp.class (firstobs=6 obs=10);
run;
data junk3;
   set sashelp.class (firstobs=11 obs=15);
run;

/*or*/

data junk1
     junk2
     junk3
;
   if 1 le _n_ le 5 then output junk1;
   else if 6 le _n_ le 10 then output junk2;
   else if 11 le _n_ le 15 then output junk3;
run;

Not saying this is a good idea.

 

It would make much more sense to subset data by some characteristic.

 

3. this is the same question as 2 just phrased differently.

 

 

Do you need every single variable in the data? Reducing variables may make more sense as then all the values of the specified variables are available. Select the variables needed to completely identify the records and then the ones likely needed for specific analysis.

 

It may be easier when the inevitable question about "what if" comes up to add in a couple of additional variables for an analysis than having to find them from multiple data sets.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1026 views
  • 2 likes
  • 3 in conversation