turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- data step with a huge dataset-message 'lack of res...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 10:58 AM

I need to run data step for a dataset that may contain over 1000 million obs, it seems that it is too huge for SAS to handle it, it gave a message 'lack of resources' and stoped to process it, Is there any other way to handle this in SAS?

this the data I need to create:

data A;

set B(keep=n1 m1 x1 y1); *** about 5000 obs in B;

do n = n1 +1 to &total-(m1-1);

do m = m1+1 to &total-(n1+1) while (n+m <= &total); *** &total could be 90-100;

do y = y1 to n while (y1<=y<n);

do x = x1 to m while (x1<=x<m);

end; end; end; end;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 11:12 AM

What are you trying to do? Your data step appears to just go through a data-controlled number of iterations without actually accomplishing anything.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 11:21 AM

I need to calculate some probabilities in the data step (see the following code), I removed the calculation to make the code clear in the message above.

data A;

set B(keep=n1 m1 x1 y1); *** about 5000 obs in B;

do n = n1 +1 to &total-(m1-1);

do m = m1+1 to &total-(n1+1) while (n+m <= &total); *** &total could be 90-100;

do y = y1 to n while (y1<=y<n);

do x = x1 to m while (x1<=x<m);

term2_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1)

*pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p02, (n-n1));

term2_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1)

*pdf('BINOMIAL', (x-x1), &p11, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

end; end; end; end;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 11:32 AM

It would help if you could provide one record that causes the problem, along with the value of the macro variable you used. My initial guess is that you have created an infinite loop which, by definition, will simply eat up all of your resources.

And, since you don't have an output statement in the loop, if the loop isn't infinite, it would simply go through all of the iterations and result in only one record per record processed.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 12:13 PM

this is the code:

%let userN = 40;

%let p01 = 0.05;

%let p02= 0.05;

%let p11 = 0.25;

%let p12 = 0.25;

**data** s1_A;

do n1 = **2** to &userN-**2**;

do m1 = **2** to &userN-**2** while ((m1 + n1) < &userN-**2**);

do y1 = **0** to n1-**1** ;

do x1 = **0** to m1-**1** ;

term1_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1);

term1_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1);

term1_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1);

output;

end;

end;

end;

end;

**run**;

**data** s2_B;

set s1_A;

do n = n1+**1** to &userN-(m1+**1**);

do m = m1+**1** to &userN-(n1+**1**) while (n+m <= &userN);

do y = y1 to n while (y1<=y<n);

do x = x1 to m while (x1<=x<m);

term2_p0 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p02, n1)

*pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p02, (n-n1));

term2_p1 = pdf('BINOMIAL', x1, &p11, m1)* pdf('BINOMIAL', y1, &p12, n1)

*pdf('BINOMIAL', (x-x1), &p11, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

term2_p0p1 = pdf('BINOMIAL', x1, &p01, m1)* pdf('BINOMIAL', y1, &p12, n1)

*pdf('BINOMIAL', (x-x1), &p01, (m-m1))* pdf('BINOMIAL', (y-y1), &p12, (n-n1));

output;

end;

end;

end;

end;

**run**;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 12:49 PM

How big is your hard drive? Your first datastep creates a file with 107,415 records. Just including the first 2 records from that datastep into your second datastep created a file that was over 22mb in size. As such, if all of the other records produce approximately the same number of iterations, your resulting file would be approximately 2,246,014 mb in size.

I can't even test that because I don't have that much free space available on my machine.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 11:25 AM

Also I need to do other calculations based on this data, I tried to separate the whole data to several subsets, but it's not working, SAS still can't handle it.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 01:08 PM

1. I ran the earlier code you posted, with some made-up values for the input data, and didn't have any problem. I don't see any heavy use of resources that should cause any problems; it's just a very long process because it does a lot.

2. I agree completetly with Art; this job is producing a huge amount of output. I find it hard to imagine that you'll be able to do anything useful with it.

3. Exactly what diagnostic are you receiving? Is it possible to post a piece of the log that contains the message? Do you have any indication of which step is failing, and how many records had been processed when it failed?

4. If not, try to find out where the problem is. As Art was able to run the first step, I assume it's the second step that's the problem. If you insert the following line immediately after your set statement, the log should contain a line for every thousand records read.

if mod(_n_, 1000) = 0 then put _n_ =;

Tom

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 01:40 PM

Tom, I'm still going to bet it is hard drive. At least 1.25 terabytes would be needed.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 02:37 PM

My hard drive is about 200 GB. This is the error in the log file, the userN = 50, userN need to be up to 80.

ERROR: Write to WORK.S2_A.DATA failed. File is full and may be damaged.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 18914 observations read from the data set WORK.S1_B.

WARNING: The data set WORK.S2_A may be incomplete. When this step was stopped there were

656636456 observations and 28 variables.

WARNING: Data set WORK.S2_A was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 3:29:41.39

cpu time 36:00.33

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 02:25 PM

There is no way to use all of the numbers you are generating. I assume that you will want to summarize them in some way. You could include logic to do that in the data step so that only the summarized data is output. You should be able to dramatically reduce the size of the dataset that you need actually store.

If you want to summarize using a SAS proc like MEANS/SUMMARY then you could code your data step as a view. That should prevent SAS from creating the giant dataset.

data v1 / view=v1 ;

.....

run;

proc sumary data=v1 ... .;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-24-2012 03:39 PM

Thank you, Tom. I won't do proc summary, but need to calculate cumulative sum for each observation within each group (same n, m, n1, m1, x, y, x1, y1) and need to keep those observations that the cumulative sum is less than a prespecified cutoff.