Re: insufficient memory

sanjaymane7 · Posted 01-16-2024 05:02 PM

Hi, I am facing an insufficient memory error while executing the below code. I have 32 GB RAM, 500 SSD/ HD laptop. Please suggest.

data A;
input Employee_Id Date :ddmmyy. Plan $ ;
format Date date9. ;
cards;

1100589 03-02-2023 Silver
1100589 04-02-2023 Silver
1100589 05-02-2023 Silver
1100589 06-02-2023 Bronze
1100589 19-05-2023 Platinum
1100589 19-05-2023 Gold
1100589 23-05-2023 Bronze
1100589 25-05-2023 Bronze
1100589 27-05-2023 Bronze
1100589 06-07-2023 Gold
1100589 07-07-2023 Bronze
1100589 12-07-2023 Bronze
1100589 30-10-2023 Bronze
1100600 . .
;

data B;
input Employee_Id Date :ddmmyy. Plan $ Claim_amt ;
format Date date9. ;
cards;
1100589 01-02-2023 . 100
1100589 08-02-2023 . 200
1100589 21-04-2023 . 300
1100589 11-07-2023 . 400
1100589 30-10-2023 . 500
1100600 01-11-2023 . 700
;

data want;
if _n_=1 then do;
if 0 then set A;
declare hash h(dataset:'A');
h.definekey('Employee_Id','Date');
h.definedata('Plan');
h.definedone();
end;
set B;
do k=Date to Date+7 by +1;
if h.find(key:Employee_Id,key:k)=0 then leave;
end;
format k date9.;
run;

SASKiwi · Posted 01-16-2024 05:11 PM

Run this to confirm your SAS MEMSIZE setting and post the results:

proc options option = memsize;
run;

Patrick · Posted 01-16-2024 05:47 PM

Memory issues with the code and data you've posted? That shouldn't consume much memory. Here from a run on my laptop.

Your table A gets fully loaded into memory (only the key and data variables of course). In memory variables get fully expanded so should one of your data or key variables have a big length (like a description field) then memory consumption can grow quickly.

Also be aware that the memory available to a SAS session is per installation default limited to 2GB. You can check what applies for you by running:

proc options group=memory;
run;

Ksharp · Posted 01-16-2024 08:58 PM

If you were using stand alone (pc version) SAS ,you could make your sas have more memory .

Kurt_Bremser · Posted 01-17-2024 01:52 AM

What is the real size of your datasets (number of observations, number of variables used in the hash)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sanjaymane7 · Posted 01-17-2024 04:13 AM

I have 130 millions variables in dataset A. The same code works fine on sample record but it doesn't work on real dataset due to hug size.

Error log is -

WARNING: Multiple lengths were specified for the variable Employee_id by input data set(s). This can cause truncation of data.
ERROR: Hash object added 44040176 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0 observations and 12 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:00.24
cpu time 40.14 seconds

Patrick · Posted 01-17-2024 05:31 AM

@sanjaymane7 Based on the log you shared you would need around 3 times the memory available to your SAS session in order to load your table into the hash. You need to talk to your SAS admin if that's feasible.

Log message "WARNING: Multiple lengths were specified for the variable Employee_id by input data set(s). This can cause truncation of data." indicates that variable employee_id is of type character in your real data. If you could shorten the length of the variable prior to loading into the hash then that should also save some memory - because it gets loaded as CHAR with all the trailing blanks consuming memory up to the full length of the variable.
For the find() method you then could do something like: if h.find(key:strip(Employee_Id),key:k)=0 then leave;

...and...

Is one or both of your source tables by any chance already sorted by employee_id?

How many rows are in your real "table B"?

Quentin · Posted 01-17-2024 09:26 AM

@Patrick 's question about sort order is a good hint. If your data are sorted by Employee_ID, then this can be changed to an approach where you would hash in chunks. For each ID, you would load the hash table with data for that Employee_ID, then do the look-ups for that ID. With that approach, you only need enough memory to load the hash table for one ID.

Alternatively, if work.A is sorted by ID and date, and if work.B is not huge, you could always make a new work.B with seven records per ID (one per each date), and then just merge that against work.A to do the lookup. Part of the original genius of the DATA step is that it was built for computers with little memory, so MERGE doesn't care how many records you have.

Yet another option would be to try a SQL join and see how the optimizer approaches the problem. It would likely depend on the size of work.B.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Kurt_Bremser · Posted 01-17-2024 05:41 AM

What are the lengths of employee_id and plan in dataset "A"? What is your MEMSIZE setting?

And is your SAS running on your PC, or are you connecting to a server?

As a first step, make sure that employee_id is defined with the same length in all your datasets. Similarly, look at the contents of plan

proc sql;
select max(length(plan)) from a;
quit;

to see if plan is properly defined there.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sanjaymane7 · Posted 01-17-2024 06:04 AM

Length of employee id is 10 char in both table.

I am using laptop with the following config - 32GB ram and 500 GB hard disk, processor - Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz, system type - 64-bit operating system, x64-based processor.

Kurt_Bremser · Posted 01-17-2024 07:40 AM

@sanjaymane7 wrote:

Length of employee id is 10 char in both table.

Positively not. From your log:

WARNING: Multiple lengths were specified for the variable Employee_id by input data set(s). This can cause truncation of data.

This message will only appear when the variable is defined differently in the datasets.

I am using laptop with the following config - 32GB ram and 500 GB hard disk, processor - Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz, system type - 64-bit operating system, x64-based processor.

That's what you use. But does your SAS use it (meaning that you have a fully local SAS installation)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 01-17-2024 09:54 AM

@sanjaymane7 wrote:

Length of employee id is 10 char in both table.

I am using laptop with the following config - 32GB ram and 500 GB hard disk, processor - Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz, system type - 64-bit operating system, x64-based processor.

The relevant bit is where SAS is installed and executes (not EG/Studio but the server you connect to). The specs of your laptop are only relevant if you connect to a locally installed SAS.

IF you have a locally installed SAS then follow @Ksharp's recommendation and increase the max memsize that a SAS session can consume to something like 16GB.

FreelanceReinh · Posted 01-17-2024 05:36 AM

Hi @sanjaymane7,

I would expect the MEMSIZE adjustment suggested by others to solve the issue. In addition, you could reduce the memory requirement for the hash object, if needed: Your sample data (using 3*8=24 bytes plus overhead for the key and data items of one observation) suggest that Employee_ID and Date could be combined into a single 8-byte key. On my workstation the value of h.item_size drops from 64 to 48 by doing so. Abbreviating Plan to a 1-byte data item does not reduce this value further, though. What is the item size with your real data?

Tom · Posted 01-17-2024 10:28 AM

I doubt that HASH is the way to go for this problem (even if you could force all of the data into memory). That is because it looks like you want to find the plan that was in effect on a date. Normally the information on plans has START/STOP dates and not a record for every date in the interval. But a HASH object requires the keys to match exactly.

A more efficient method would be to interleave the two datasets and remember the most recent plan that was in effect at the event date.

So assuming that A represents when the subject started on a plan and B represents the event date (claim date) then something like this should work.

data want2;
  set a(in=inplan) b(in=inevent);
  by employee_id date ;
  if inplan then do;
    lastplan=plan;
    lastdate=date;
  end;
  retain lastplan lastdate;
  if inevent and not first.employee_id then do;
    if date - 7 <= lastdate <= date then plan=lastplan;
    output;
  end;
  format lastdate yymmdd10.;
run;

Which will produce slightly different results than your code which is looking for the NEXT plan they signed up for instead of the CURRENT plan they have.

Tom · Posted 01-17-2024 11:16 AM

And if you did want to search for the NEXT plan instead of the current plan then perhaps an ARRAY would be better.

Use the DATE as the index into the array. Make sure to define the array large enough for all possible event and plan dates (and don't forget about the extra 7 days).

data want3;
  do until(last.employee_id);
    set a(in=inplan where=(not missing(date))) b(in=inevent) ;
    by employee_id  ;
    array plans [%sysfunc(mdy(1,1,2023)):%sysfunc(mdy(1,31,2024))] $8 ;
    drop plans: ;
    if inplan then do;
      plans[date]=plan; 
    end;
    if inevent then do;
      do k=date to date+7 until(not missing(plan));
        plan=plans[k];
      end;
      output;
    end;
  end;
  format k yymmdd10.;
run;

Result: