BookmarkSubscribeRSS Feed
jojo
Obsidian | Level 7


I need to run data step for a dataset that may contain over 1000 million obs, it seems that it is too huge for SAS to handle it, it gave a message 'lack of resources' and stoped to process it, Is there any other way to handle this in SAS?

this the data I need to create:

data A;

      set B(keep=n1 m1 x1 y1);  *** about 5000 obs in B;

      do n = n1 +1 to &total-(m1-1);

          do m = m1+1 to &total-(n1+1) while (n+m <= &total);  *** &total could be 90-100;

               do y = y1 to n while (y1<=y<n);

                   do x = x1 to m while (x1<=x<m);

                   end; end; end; end;

run;

11 REPLIES 11
art297
Opal | Level 21

What are you trying to do?  Your data step appears to just go through a data-controlled number of iterations without actually accomplishing anything.

jojo
Obsidian | Level 7

I need to calculate some probabilities in the data step (see the following code), I removed the calculation to make the code clear in the message above.

data A;

      set B(keep=n1 m1 x1 y1);  *** about 5000 obs in B;

      do n = n1 +1 to &total-(m1-1);

          do m = m1+1 to &total-(n1+1) while (n+m <= &total);  *** &total could be 90-100;

               do y = y1 to n while (y1<=y<n);

                   do x = x1 to m while (x1<=x<m);

               term2_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1)

                       *pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p02, (n-n1));

               term2_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1)

                       *pdf('BINOMIAL', (x-x1), &p11, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

                   end; end; end; end;

run;

art297
Opal | Level 21

It would help if you could provide one record that causes the problem, along with the value of the macro variable you used.  My initial guess is that you have created an infinite loop which, by definition, will simply eat up all of your resources.

And, since you don't have an output statement in the loop, if the loop isn't infinite, it would simply go through all of the iterations and result in only one record per record processed.

jojo
Obsidian | Level 7

this is the code:

%let userN = 40;

%let p01 = 0.05;

%let p02= 0.05;

%let p11 = 0.25;

%let p12 = 0.25;

data s1_A;

  do n1 = 2 to &userN-2;

    do m1 = 2 to &userN-2 while ((m1 + n1) < &userN-2);

       

     do y1 = 0 to n1-1

           do x1 = 0 to m1-1 ;

                

                 term1_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1);

                 term1_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1);

           term1_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1);

       

         output;

           end;

     end;    

  end;

end;  

run;

data s2_B;

     set s1_A;

          do n = n1+1 to &userN-(m1+1);

               do m = m1+1 to &userN-(n1+1) while (n+m <= &userN);

               do y = y1 to n while (y1<=y<n);

                    do x = x1 to m while (x1<=x<m);                                        

                        

            term2_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1)

                       *pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p02, (n-n1));

            term2_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1)

                       *pdf('BINOMIAL', (x-x1), &p11, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

            term2_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1)

                       *pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

               output;                  

               end;

          end;            

      end;

   end;

run;

art297
Opal | Level 21

How big is your hard drive?  Your first datastep creates a file with 107,415 records.  Just including the first 2 records from that datastep into your second datastep created a file that was over 22mb in size.  As such, if all of the other records produce approximately the same number of iterations, your resulting file would be approximately 2,246,014 mb in size.

I can't even test that because I don't have that much free space available on my machine.

jojo
Obsidian | Level 7

Also I need to do other calculations based on this data, I tried to separate the whole data to several subsets, but it's not working, SAS still can't handle it.

TomKari
Onyx | Level 15

1. I ran the earlier code you posted, with some made-up values for the input data, and didn't have any problem. I don't see any heavy use of resources that should cause any problems; it's just a very long process because it does a lot.

2. I agree completetly with Art; this job is producing a huge amount of output. I find it hard to imagine that you'll be able to do anything useful with it.

3. Exactly what diagnostic are you receiving? Is it possible to post a piece of the log that contains the message? Do you have any indication of which step is failing, and how many records had been processed when it failed?

4. If not, try to find out where the problem is. As Art was able to run the first step, I assume it's the second step that's the problem. If you insert the following line immediately after your set statement, the log should contain a line for every thousand records read.

if mod(_n_, 1000) = 0 then put _n_ =;

Tom

art297
Opal | Level 21

Tom,  I'm still going to bet it is hard drive.  At least 1.25 terabytes would be needed.

jojo
Obsidian | Level 7

 

My hard drive is about 200 GB. This is the error in the log file, the userN = 50, userN need to be up to 80.

ERROR: Write to WORK.S2_A.DATA failed. File is full and may be damaged.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 18914 observations read from the data set WORK.S1_B.

WARNING: The data set WORK.S2_A may be incomplete. When this step was stopped there were

656636456 observations and 28 variables.

WARNING: Data set WORK.S2_A was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 3:29:41.39

cpu time 36:00.33

Tom
Super User Tom
Super User

There is no way to use all of the numbers you are generating.  I assume that you will want to summarize them in some way.  You could include logic to do that in the data step so that only the summarized data is output.  You should be able to dramatically reduce the size of the dataset that you need actually store.

If you want to summarize using a SAS proc like MEANS/SUMMARY then you could code your data step as a view. That should prevent SAS from creating the giant dataset.

data v1 / view=v1 ;

   .....

run;

proc sumary data=v1 ... .;

run;

jojo
Obsidian | Level 7

Thank you, Tom. I won't do proc summary, but need to calculate cumulative sum for each observation within each group (same n, m, n1, m1, x, y, x1, y1)  and need to keep those observations that the cumulative sum is less than a prespecified cutoff.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1409 views
  • 0 likes
  • 4 in conversation