Solved: Re: DATA MERGE on large datasets but SAS runs out of memory

Cruise · Posted 04-23-2019 08:41 AM

Hi Folks, I have a work space issue

I ran SAS overnight to merge Z.ALL_SITES (76.4GB) and Z.DATA_LARGE (560MB). SAS crashed "out of memory" this morning. And I cleaned H and Z folders except the datasets needed. Folder ''H" has 1.96TB for a free space. Folder Z has 490GB for a free space.

I'll appreciate any suggestions.

Thanks zillions in advance. This is being a huge roadblock. I can't move on my project because of this memory issue.

DATA H.MERGED;
MERGE
     Z.ALL_SITES  (IN=A)
     Z.DATA_LARGE (IN=B);
BY ID;
IF A AND B;
RUN;

Patrick · Posted 04-23-2019 10:33 AM

@Cruise

Here you go:

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=(id var1 var2));
      dcl h1 (dataset:'z.data_large(keep=(id var1 var2))';
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  do while(h1.do_over() eq 0);
    output;
  end;
run;

http://support.sas.com/documentation/cdl/en/lecompobjref/69740/HTML/default/viewer.htm#p07odyzmifa4y...

View solution in original post

Kurt_Bremser · Posted 04-23-2019 08:56 AM

Maxim 3: Know Your Data. Inspect your input datasets for large columns, and if those datasets are stored with compress=yes.

If you are unsure, post the output of proc contents for both datasets here.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 04-23-2019 09:21 AM

@Kurt_Bremser @PaigeMiller

Thanks a lot. Below is the proc contents of both datasets. I only need ID, VAR1 and VAR2 from the DATA_LARGE and tempted to keep the variables I only needed. However, since I can not remove original DATA_LARGE from the folder then this would just reduce the work space in the folder R.

Correct?

What do you think?

DATA R.DATA_LARGEA; SET R.DATA_LARGE(KEEP=ID VAR1 VAR2);
WHERE VAR1 NE .; 
RUN;

Kurt_Bremser · Posted 04-23-2019 09:32 AM

You can do all of the work in one step:

data h.merged (compress=yes);
merge
  z.all_sites (in=a)
  z.data_large (
    in=b
    keep=id var1 var2
    where=(var1 ne .)
  )
;
by id;
if a and b;
run;

Please post the log from that step, if it fails. Use the {i} button for posting logs.

PS if you enable listing output in Enterprise Guide, you can copy/paste the text to a {i} window also, so you don't need to make screenshots.

It's my favorite method of posting results.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 04-23-2019 09:13 AM

@Cruise wrote:

Hi Folks, I have a work space issue

I ran SAS overnight to merge Z.ALL_SITES (76.4GB) and Z.DATA_LARGE (560MB). SAS crashed "out of memory" this morning. And I cleaned H and Z folders except the datasets needed. Folder ''H" has 1.96TB for a free space. Folder Z has 490GB for a free space.

A minor point here: running out of memory is not the same thing as not having enough free disk space. I think @Kurt_Bremser is on the right track to solving the problem.

--
Paige Miller

Patrick · Posted 04-23-2019 09:23 AM

@Cruise

The "out of memory" issue is kind of weird given the code you've posted.

The merge step would only ever load a single row at a time into memory (the PDV) and though not require much memory.

The only reason I can think of that the code you've posted could throw an "out of memory" error is if the "tables" in the Merge statement are actually Views - and it's these views which consume a lot of memory during view execution time (which would happen as part of the merge statement).

...so: If my theory applies then the first step is to change the views to tables (with option compress=yes); and eventually try to implement code for the views (now tables) which is more memory efficient.

Astounding · Posted 04-23-2019 09:25 AM

Regarding data set compression, the issue is not whether the incoming data sets are compressed or not. They are on the Z drive, and you need more space on the H drive.

If the data contains a lot of character variables, add this statement before the merge:

options compress=yes;

If it contains more numeric variables, add this instead:

options compress=binary;

Cruise · Posted 04-23-2019 09:47 AM

@Kurt_Bremser @Astounding

Kurt, I'm running the code you suggested. Last time it crashed after few hours when I ran it overnight.But this time I don't know how many hours this will run. I will post the log as soon as it's finished.

I'm concerned about Astounding's pointer that I might need more space in H drive. H drive is empty for a purpose to output MERGED dataset. I can't change the size of this folder easily because it's not my private computer.

Patrick · Posted 04-23-2019 09:57 AM

@Cruise

It appears @Ksharp actually looked at your Proc Contents report and calculated the required memory for the variables you're after. If that's only around 1/2 GB then you've got likely sufficient memory and the data step hash lookup could work.

Below code based on the code you've posted but of course not tested as you haven't provided sample data. Give it a go an let us know if it works.

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=(id var1 var2));
      dcl h1 (dataset:'z.data_large(keep=(id var1 var2))';
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  if h1.find()=0 then output;
run;

Patrick · Posted 04-23-2019 10:02 AM

@Cruise

Thanks for the "like". Please use the latest (current) cut of my code for testing as I had to fix it since the initial post.

Patrick · Posted 04-23-2019 10:21 AM

@Cruise

And just for clarity: The code I've posted assumes that the relationship between z.all_sites to z.data_large is N:1 or 1:1.

If the relationship is 1:N and you're after left join logic then the code I've posted would require some additional looping (using the hash do_over method).

Cruise · Posted 04-23-2019 10:28 AM

@Patrick

Hi Patrick,

This is 1:N relationship. 1 side has unique ID with no duplicates. N side is all the claim that patient had ever made in her/his entire medical history. So 1:N can actually be 1:1000 at least all the way to 1:5000 because claim data captures every drug, diagnosis or procedures of multiple chronic diseases.

What can be done in this context? like you said looping? can you extend the code for looping? if possible?

Patrick · Posted 04-23-2019 10:30 AM

@Cruise

O.K., I'll post some code. Without sample data to test I can't promise you though that it will be syntactically correct. It's getting late where I live so others might need to take over after that.

Cruise · Posted 04-23-2019 10:32 AM

@Patrick

I'll provide you a sample data asap.

Patrick · Posted 04-23-2019 10:33 AM

@Cruise

Here you go:

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=(id var1 var2));
      dcl h1 (dataset:'z.data_large(keep=(id var1 var2))';
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  do while(h1.do_over() eq 0);
    output;
  end;
run;

http://support.sas.com/documentation/cdl/en/lecompobjref/69740/HTML/default/viewer.htm#p07odyzmifa4y...

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away