Re: Maximum number of variables allowed using SAS

xtc283x · Posted 11-23-2020 04:10 PM

Historically, the SAS PDV only permitted about 32k variables. Has this changed in recent years? If so, what is the current maximum number of variables allowed by the PDV?

ballardw · Posted 11-23-2020 04:35 PM

If you get close to running out of variable allocations you need to seriously consider what you/how you are doing something.

I just created an empty data set with 400,000 numeric variables.

ChrisNZ · Posted 11-23-2020 06:02 PM

I just created a data set with one million variables.

High-Performance SAS Coding - Third Edition

xtc283x · Posted 11-23-2020 04:58 PM

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.

SASKiwi · Posted 11-23-2020 05:24 PM

SAS Viya goes beyond some of the constraints of traditional SAS (SAS Release 9) also. If you want further guidance then tell us more about your use case.

ballardw · Posted 11-23-2020 06:03 PM

@xtc283x wrote:

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.

With all due respect "parameters" are not the same as "variables". Proc Iml and matrix could be considered a single "parameter" for some operations, with the matrix containing 1000 rows of 1000 variables (not stating that as any limit but an example) for 1,000,000 values that could be considered parameters by something else. So the use case is somewhat important.

I killed my system when trying to create 4,000,000 variables (just for giggles) because after a few minutes it was still running. 1,000,000 variables took a bit over 4.5 seconds; 2,000,000 took about 1 minute and 50 seconds.

ChrisNZ · Posted 11-23-2020 08:34 PM

@ballardw Yes, the cost seems to be CPU-bound, and grow exponentially rather than linearly, which might point to a defect in the underlying logic.

100k => 0.5 s

1m => 10 s

1.5m => 1 min

2m => 4 min

3m => 18 min

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 11-24-2020 12:24 AM

For the mapping of variable names to PDV locations, the interpreter needs to build a search tree during data step compilation, and adding to such things always grows exponentially.

That's why temporary arrays are the fastest constructs in a data step: no individual names, addressing solely through index.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 11-24-2020 01:03 AM

I asked tech support to have a look at this phenomenon. I'll report if anything interesting comes up.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 11-24-2020 07:53 PM

Something interesting came out of the conversation with tech support.

The SAS interpreter guesses how many variables will be needed as output for the table when the output buffer (not the PDV: this data set output buffer creation logic is used by procedures as well) is created. The reason this estimation is necessary is that some procedures request way too many variables (as in: millions!).

When this estimated size is exceeded, the estimation process reruns for every variable added. There is no plan to fix this at the moment since this has never caused real-world issues. This explains the heavy CPU load.

Notes:

- In a data _null_ step, this issue does not appear as no output buffer is created (but a PDV is).

- I created a table with 8 million variables; this takes about 8 hours and uses 4GB of RAM, which is the maximum I have access to.

- @SimonDawson can tell you more if you are interested.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 11-24-2020 08:39 PM

The key sentence here:

since this has never caused real-world issues

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

xtc283x · Posted 11-24-2020 09:21 PM

The absence of 'real world issues' in a SAS world may not be due to the demand for millions and/or billions of features, parameters or variables since Google engineers are running algorithms of that magnitude on a routine basis, and more due SAS' inability to deliver and perform at that level.
Just saying...

ChrisNZ · Posted 11-24-2020 10:08 PM

> Just saying...

Partly true, chicken and eggs....

Having said that such large models are typically run in memory, so a fairer comparison would be to look at CAS data. Not too sure what's happening there.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 11-26-2020 05:52 AM

I guess you confuse the number of data items that a data or proc step can handle simultaneously with the need to define a name for each individual item.

As other languages, SAS provides constructs for this. The fastest is a temporary array, and hash objects are similarly quick, and only limited by the memory available to the SAS session:

data _null_;
length n value 8;
declare hash h ();
h.definekey("n");
h.definedata("value");
h.definedone();
do n = 1 to 10000000;
  value = rand("uniform") * 1000;
  h.add();
end;
run;

Log:

27         data _null_;
28         length n value 8;
29         declare hash h ();
30         h.definekey("n");
31         h.definedata("value");
32         h.definedone();
33         do n = 1 to 10000000;
34           value = rand("uniform") * 1000;
35           h.add();
36         end;
37         run;

NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
      real time           9.70 seconds
      cpu time            2.56 seconds

In just 10 seconds, the step created 20 million numeric values, and built the search tree for one of them.

This on a 2-core pSeries server with a MEMSIZE of 512M.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 11-23-2020 06:08 PM

And there are many other ways of storing information that doesn't require data in that form, so there are other options such as pairing SAS with Hadoop or other big data technologies. Data storage, analysis/processing and modeling are not limited by the same things.

@xtc283x wrote:

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away