Solved: Re: DATA MERGE on large datasets but SAS runs out of memory - Page 2

Cruise · Posted 04-23-2019 10:14 AM

is h1 taking its h from libname h? I'm trying to customize the code to my actual directories. thanks

Patrick · Posted 04-23-2019 10:25 AM

h1 is the reference for the hash table the code defines (command DCL stands for DECLARE). You could use any name here. Conceptually the same like when defining a libref in a libname statement.

The libref for your source table is used within the dataset option of the Hash Declare statement.

Ksharp · Posted 04-23-2019 09:46 AM

Since Z.DATA_LARGE (560MB) is very small.

Hash Table is the best way to do it .

or try PROC FORMAT .

Cruise · Posted 04-23-2019 10:44 AM

@Patrick I get this error.

254  data O.MERGED;
255    if _n_=1 then
256      do;
257        if 0 then set R.data_large(keep=(id var1 var2));
                                   -
                                   214
                                   23
258        dcl h1 (dataset:'R.data_large(keep=(id var1 var2))';
               --
               565
ERROR: DATA STEP Component Object failure.  Aborted during the COMPILATION phase.
ERROR 214-322: Variable name ( is not valid.

ERROR 23-7: Invalid value for the KEEP option.

ERROR 565-185: H1 is not a known class name.

NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

259        h1.defineKey('id');
260        h1.defineData('var1','var2');
261        h2.defineDone();
262      end;
263    set R.all_sites;
264    do while(h1.do_over() eq 0);
265      output;
266    end;
267  run;

Patrick · Posted 04-23-2019 10:56 AM

@Cruise

That's why I normally need sample data. I tend to add "typos" to my code which I normally quickly can spot and fix when testing. Here: No brackets required for the dataset KEEP option.

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=id var1 var2);
      dcl h1 (dataset:'z.data_large(keep=id var1 var2)');
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  do while(h1.do_over() eq 0);
    output;
  end;
run;

Patrick · Posted 04-23-2019 10:59 AM

And @novinosrin is absolutely right. Multidata is also required in your case. So here the latest cut:

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=id var1 var2);
      dcl h1 (dataset:'z.data_large(keep=id var1 var2)',multidata:'y');
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  do while(h1.do_over() eq 0);
    output;
  end;
run;

novinosrin · Posted 04-23-2019 10:52 AM

To whom it may concern,

I think the very moment the use of do over aka find_next before 9.4 equivalent is in place, and should the look up be 1:N , dcl hash h1 (dataset:'large',multidata:'y'); would be required.

@Cruise That's just a syntax error

Here is the correction

dcl hash h1 (dataset:'R.data_large(keep=(id var1 var2))';

Cruise · Posted 04-23-2019 11:08 AM

@Patrick @novinosrin

below code got SAS running but gave me an error below when I used latest cut from Patrick adding hash in DCL statement as novinosirin suggested. I'll post the sample data soon.

353  data k.merged;
354    if _n_=1 then
355      do;
356        if 0 then set R.data_large(keep=id var1 var2);
NOTE: Data file R.data_large is in a format that is native to another host, or the file encoding does
      not match the session encoding. Cross Environment Data Access will be used, which might
      require additional CPU resources and might reduce performance.
357        dcl hash h1 (dataset:'R.DX(keep=id var1 var2)',multidata:'y');
358        h1.defineKey('ID');
359        h1.defineData('var1','var2');
360        h1.defineDone();
361      end;
362    set R.all_sites;
363    do while(h1.do_over() eq 0);
364      output;
365    end;
366  run;

NOTE: Data file R.data_large.DATA is in a format that is native to another host, or the file encoding does
      not match the session encoding. Cross Environment Data Access will be used, which might
      require additional CPU resources and might reduce performance.
ERROR: Hash object added 22020080 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set K.MERGED may be incomplete.  When this step was stopped there were 0
         observations and 4 variables.
WARNING: Data set K.MERGED was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           1:31.52
      cpu time            19.79 seconds

Patrick · Posted 04-23-2019 11:28 AM

@Cruise

ERROR: Hash object added 22020080 items when memory failure occurred

That's now a totally different issue. A SAS Hash table resides in memory and it appears that the load into memory failed after 22M rows.

That's nothing we can fix unless you get more memory for your session. May be also excluding all rows where=(var1 ne .) helps to keep the total rows below this 22M+ limit - something I should have added to the code from start.

You can estimate the memory requirement by summing up the lengths of the 3 variables loaded into the hash multiplied by the number of rows in your table (with none missing var1).

You can get information about memory available via Proc Options group=memory;run;

You will need to talk to your SAS admin if you require more memory than allowed to you for a SAS session.

data h.merged;
  if _n_=1 then 
    do;
      if 0 then set z.data_large(keep=id var1 var2);
      dcl h1 (dataset:'z.data_large(keep=id var1 var2  where=(var1 ne .))',multidata:'y;');
      h1.defineKey('id');
      h1.defineData('var1','var2');
      h1.defineDone();
    end;
  set z.all_sites;
  do while(h1.do_over() eq 0);
    output;
  end;
run;

Cruise · Posted 04-23-2019 11:31 AM

I see what you're saying. What if I deduplicate the data_large by id var1 var2 before attempting a merge? I'll give it a try. Thanks Patrick!!!

Cruise · Posted 04-23-2019 11:50 AM

@Patrick @novinosrin

proc sort nodupkey is being memory intensive as well. at this point I'm wondering to use hash table approach by each site using where statement. all_sites data has 40 sites and i can do (where site=1 thru site=40) one site at a time.

1. Will hash table save time compared with merge in data step?

2. can you point out where I can insert where statement in the hash table code?

thanks in advance.

novinosrin · Posted 04-23-2019 11:57 AM

Hi @Cruise Hmm proc sort isn't working?. I was about to recommend the below(that clears memory) with confidence but now recommending without confidence

The following assumed both datasets have the same ID's. The comments will give you an idea


/*Creating samples- full deck N */
data large;
set sashelp.class;
run;
/*Creating small with unique Id's as you mentioned 1:N*/
proc sort data=sashelp.class out=small(keep=sex) nodupkey;
by sex;
run;
proc sort data=large ;
by sex;
run;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then 
    do;
      if 0 then set small large;
      dcl hash h1 (multidata:'y');
      h1.defineKey('sex');
      h1.defineData('name','age','height','weight');
      h1.defineDone();
    end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.sex);
set large;
by sex;
rc=h1.add();
end;
/*Reading your small and look uo the large*/
do until(last.sex);
set small;
by sex;
  do while(h1.do_over() eq 0);
    output;
  end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;

Cruise · Posted 04-23-2019 12:59 PM

thanks a lot. proc sort nodupkey now been running for over an hour. hope nodupkey data will be created soon so i can use your posted code.

Cruise · Posted 04-23-2019 03:48 PM

@novinosrin

proc sort nodupkey by ID CODES AND DATE halved the data size. However, the code you provided output want data with 0 observations. Any idea what I might be doing wrong at my end? Below is my large and small data.


DATA LARGE;
INPUT ID $ CODES $ DATE; 
CARDS;
ALLSS 04185 18127 
ALLSS 04185 18129 
ALLSS 04185 18130 
ALLSS 25000 16840 
ALLSS 25000 16874 
ALLSS 25000 16972 
ALLSS 25000 17063 
ALLSS 25000 17184 
ALLSS 25000 17188 
AALLSS 25000 17231 
AALLSS 25000 17321 
AALLSS 25000 17458 
AALLSS 25000 17604 
AALLSS 25000 17867 
AALLSS 25000 17979 
AALLSS 25000 18078 
AALLSS 25002 17374 
AALLSS 2669 17545 
AALLSS 2689 17374 
AALLSS 2689 17601 
AALLSS 2720 17184 
AALLSS 2724 16840 
AALLSS 2724 17063 
AALLSS 2724 17175 
AALLSS 2724 17231 
AALLSS 2724 17321 
AALLSS 2724 17374 
AALLSS 2724 17601 
AALLSS 2724 17867 
AALLSS 2724 17979 
AALLSS 2729 17188 
AALLSS 2752 17184 
AALLSS 2772 17825 
AALLSS 27800 17184 
AALLSS 2859 16552 
AALLSS 2859 16664 
AALLSS 2859 16852 
AALLSS 2859 16909 
AALLSS 2948 17184 
AALLSS 30000 17545 
AALLSS 311 16874 
AALLSS 311 16972 
AALLSS 311 17321 
AALLSS 311 17458 
AALLSS 311 18078 
AALLSS 32723 17184 
AALLSS 3310 17188 
AALLSS 4019 16636 
AALLSS 4019 16840 
AALLSS 4019 16874 
AALLSS 4019 16909 
AALLSS 4019 16972 
AALLSS 4019 17063 
AALLSS 4019 17175 
AALLSS 4019 17184 
AALLSS 4019 17188 
AALLSS 4019 17231 
AALLSS 4019 17321 
AALLSS 4019 17374 
AALLSS 4019 17458 
AALLSS 4019 17601 
AALLSS 42760 17604 
AALLSS 4293 17184 
AALLSS 43491 17184 
AALLSS 43491 17185 
AALLSS 43491 17186 
AALLSS 43491 17187 
AALLSS 4660 17825 
AALLSS 496 16840 
AALLSS 496 16874 
AALLSS 496 16905 
AALLSS 496 16935 
AALLSS 496 16966 
AALLSS 496 16996 
AALLSS 496 17027 
AALLSS 496 17058 
AALLSS 496 17088 
AALLSS 496 17119 
AALLSS 496 17149 
AALLSS 496 17180 
AALLSS 496 17211 
AALLSS 496 17239 
AALLSS 496 17270 
AALLSS 496 17300 
AALLSS 496 17331 
AALLSS 496 17361 
AALLSS 496 17392 
AALLSS 496 17423 
AALLSS 496 17453 
AALLSS 496 17458 
AALLSS 496 17484 
AALLSS 496 17514 
AALLSS 496 17545 
AALLSS 496 17576 
AALLSS 496 17605 
AALLSS 496 17636 
AALLSS 496 17666 
AALLSS 496 17697 
AALLSS 496 17727 
AALLSS 496 17758 
;

DATA SMALL;
MBR_ID='AALLSS';
RUN;

novinosrin · Posted 04-23-2019 03:57 PM

Hi @Cruise

Please clarify the below:

1. Do both the datasets have the same set of ID's or could it differ? /*this is very important as you would have seen the note in my previous thread*/

2. Do the names of the ID's or in other words keys differ?

SAS Innovate 2025: Call for Content

Classroom Training Available!