Solved: what's the limit to how many elements(variables) a SAS array can hold?

novinosrin · Posted 09-27-2018 03:23 PM

what's the limit to how many elements(variables) a SAS array can hold?

1. if it is subject to memory, what is that?

2. Why temporary arrays are fastest ?i.e what makes it incredibly fast?

3. Does processing efficiency differ between implicit arrays and explicit array ? although SAS docs don't encourage implicit, but why?

4. Does processing efficiency differ between retained and non retained arrays? if yes, why?

5. leading on from 4th, should I assume processing efficiency takes a hit when having a combination of retained, non retained compile time and execution time var array?

My initial guesses/thought: physical address being contiguous perhaps in easily grouped arrays perhaps?? Well, either way my glass is not even close to half full 😞

Any clear( i mean really clear) docs plz coz it's only fair that I don't bother your valuable time, so some directions to docs for me to self research will suffice.

More than the what, i am keen to understand the why

Best regards

s_lassen · Posted 09-28-2018 05:19 AM

My two cents:

Temporary arrays are faster because they are actual arrays, that is contiguous blocks of memory. Variable arrays may also be in contiguous memory, but as they do not have to be (and character variable arrays can have elements of different lengths), SAS creates an adress table to look the elements up. So, if you refer to a variable array, SAS first uses the offset to find an entry in the address table, and then uses that to find the variable. If you use a _TEMPORARY_ array, the elements are found simply by calculating offset from base address.

Implicit array references (do over) are not encouraged, because SAS Institute intends to do away with them (it has been an undocumented feature for years, now). So if you use them, you risk that your program will have to be rewritten when the next version of SAS comes out. Performance-wise, DO OVER is just the same as explicit array processing (a temporary variable named _I_ is created in the background and used to loop through the array).

View solution in original post

data_null__ · Posted 09-27-2018 04:16 PM

Here is a simple macro to investigate question 1.

options fullstimer=1;
%macro main(arg);
   %local i x;
   %do i = 1 %to &arg;
      %let x = %sysevalf(1e&i,integer);
      %put NOTE: &=i &=x;
      data _null_;
         array a[&x];
         run;
      %end;
   %mend;
%main(5);

ballardw · Posted 09-27-2018 04:35 PM

Question does seem to be memory limited:

293  data work.junk;
294     array z{10000000};
295  run;

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: DATA statement used (Total process time):
      real time           1:51.52
      cpu time            1:51.03

but whether that is my actual system or program use limits who knows.

No problem creating 1 million variables, as above for 10 million. So somewhere in between on my system at least.

If by "implicit array" you mean accessing the elements using DO OVER I would be fairly certain that implicit arrays are somewhat slower because of finding limits and the manipulation in the background of whatever substitutes for an explicit index. Why not use implicit arrays? Try doing a 2 or higher order dimension array. Or using aligned values in two or more separate arrays.

By "retained array" do you mean all elements of an array appear on a RETAIN statement? Or an implicit retain such as a[I] +1? Or something else?

And as a general comment on "processing efficiency": there can be a number of ways to measure "efficiency" a few: Is it execution time, memory use, programming time, program maintenance time or some combination of all of them. Is it actually efficient to spend 10 hours finding a process that shaves 0.0001 seconds from run time if the program is only every going to run 10 times? or 10,000,000 times?

Kurt_Bremser · Posted 09-28-2018 02:29 AM

I ran into the insufficient memory much earlier (400000 array elements). Since the pure data size of this array (32MB) is way below the MEMSIZE I have (256 MB), it has to be the allocation of names (32 bytes each) which causes the memory overflow. I could verify this by creating a temporary array, which happens noticeably faster, and overflows somewhere between 10 and 100 million elements.

Temporary arrays have no entries in the variable name table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 09-27-2018 05:56 PM

As other respondents to your questions have shown, every one of your questions (except 3, which is a "why") is answerable by relatively straightforward coding experiments - which I suspect can be an addictive part of programming. After all, in programming, it's usually easy to answer questions of resource limits.

As to why temporary arrays are faster. Consider fundamental attributes of temporary arrays

the array value are automatically retained across iterations of the data step. No need to reset to missing, or to replace when a SET, MERGE, or UPDATE statement is encountered.
the values are not output to the resulting data set, so no need for putting into the output buffer.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 09-27-2018 07:36 PM

@mkeintz wrote:

As other respondents to your questions have shown, every one of your questions (except 3, which is a "why") is answerable by relatively straightforward coding experiments - which I suspect can be an addictive part of programming. After all, in programming, it's usually easy to answer questions of resource limits.

As to why temporary arrays are faster. Consider fundamental attributes of temporary arrays

the array value are automatically retained across iterations of the data step. No need to reset to missing, or to replace when a SET, MERGE, or UPDATE statement is encountered.

the values are not output to the resulting data set, so no need for putting into the output buffer.

Better phrased than what came to me when I was flashing-back to Assembler coding of arrays ...

Kurt_Bremser · Posted 09-28-2018 02:36 AM

@mkeintz wrote:

As other respondents to your questions have shown, every one of your questions (except 3, which is a "why") is answerable by relatively straightforward coding experiments - which I suspect can be an addictive part of programming. After all, in programming, it's usually easy to answer questions of resource limits.

As to why temporary arrays are faster. Consider fundamental attributes of temporary arrays

the array value are automatically retained across iterations of the data step. No need to reset to missing, or to replace when a SET, MERGE, or UPDATE statement is encountered.

the values are not output to the resulting data set, so no need for putting into the output buffer.

Add

3. no names for individual array elements need to be created (32 bytes each).

This sheds new light on the discussion about increasing the possible size of SAS variable names. Imagine the SAS system allows 256 characters, then a 100000 elements array would need less than 1 MB for the numeric data, but > 200 MB just for the names!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 09-28-2018 05:19 AM

My two cents:

Temporary arrays are faster because they are actual arrays, that is contiguous blocks of memory. Variable arrays may also be in contiguous memory, but as they do not have to be (and character variable arrays can have elements of different lengths), SAS creates an adress table to look the elements up. So, if you refer to a variable array, SAS first uses the offset to find an entry in the address table, and then uses that to find the variable. If you use a _TEMPORARY_ array, the elements are found simply by calculating offset from base address.

Implicit array references (do over) are not encouraged, because SAS Institute intends to do away with them (it has been an undocumented feature for years, now). So if you use them, you risk that your program will have to be rewritten when the next version of SAS comes out. Performance-wise, DO OVER is just the same as explicit array processing (a temporary variable named _I_ is created in the background and used to loop through the array).

novinosrin · Posted 09-28-2018 10:38 AM

Thank you @s_lassen While all responses are indeed great, the mechanics of look up in contiguous/non contiguous blocks of memory that contains addresses is something I have been desperately seeking to understand. I think you seem to have nailed what I am after. If i could by perhaps with playing around with addr/long and peek/long may eventually make me feel good. Very interesting and sound directive. Tak!!!!

data_null__ · Posted 11-28-2018 07:16 PM

Syntax note. Implicit array syntax has an optional index-variable parameter to define an index variable other than _I_.

47   data _null_;
48      array s(j) s1-s4;
49      do over s;
50         s = j;
51         put _all_;
52         end;
53      run;

j=1 s1=1 s2=. s3=. s4=. _ERROR_=0 _N_=1
j=2 s1=1 s2=2 s3=. s4=. _ERROR_=0 _N_=1
j=3 s1=1 s2=2 s3=3 s4=. _ERROR_=0 _N_=1
j=4 s1=1 s2=2 s3=3 s4=4 _ERROR_=0 _N_=1

novinosrin · Posted 11-28-2018 08:34 PM

Guru @data_null__ Thank you. Very nice indeed.

mkeintz · Posted 11-29-2018 04:19 PM

In the case of non-temporary arrays, SAS appears to move the variables to make a array elements contiguous. For example, in the program below, I assigned the array N to eight different orderings of the variables age height weight from sashelp.class. I took 24 bytes (peekclong) starting from the address of N{1}. And then assigned the first 8 bytes to variable new1, the second 8 bytes to new2, and bytes 17-24 to new3.

In each case new1, new2, new3 had identical values to the original variables assigned to the array - NO MATTER WHAT ORDER I USED. So in this case SAS moved variables to maintain an order compatible with the array declaration. BTW, the storage location of the array is quite different than the storage locations used when the variables are not assigned to an array - the variables are MOVED, not COPIED. So variables assigned to an array are re-arranged to improve performance in array usage. (I didn't examine the case of character variables).

Of course, if I used those variables in a different order in a second array, that additional array would not have the benefit of contiguously stored elements.

data _null_;

  set sashelp.class (obs=1 );
  put (_numeric_) (=) //;

  *array n {*} height weight age;
  *array n {*} weight age height;
  *array n {*} age height weight;
  *array n {*} age weight height ;
  array n {*} weight height age ;
  *array n {*} height age weight;

  bytes24=peekclong(addrlong(n{1}),24.);
  new1=input(substr(bytes24,01,8),rb8.);
  new2=input(substr(bytes24,09,8),rb8.);
  new3=input(substr(bytes24,17,8),rb8.);
  put n{1}= @15 new1= /
      n{2}= @15 new2= /
      n{3}= @15 new3=;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 11-29-2018 04:45 PM

Thank you & Nice work. I think the caveat here is when we include an element more than once in the same non temporary retained array coming from source data set

  array n {*} weight height age weight;

The address of This is confusing to me as I am not sound with my understanding yet of how the two weight variables physical addresses are generated

what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Re: what's the limit to how many elements(variables) a SAS array can hold?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!