Historically, the SAS PDV only permitted about 32k variables. Has this changed in recent years? If so, what is the current maximum number of variables allowed by the PDV?
If you get close to running out of variable allocations you need to seriously consider what you/how you are doing something.
I just created an empty data set with 400,000 numeric variables.
I just created a data set with one million variables.
Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.
With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.
SAS Viya goes beyond some of the constraints of traditional SAS (SAS Release 9) also. If you want further guidance then tell us more about your use case.
@xtc283x wrote:
Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.
With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.
With all due respect "parameters" are not the same as "variables". Proc Iml and matrix could be considered a single "parameter" for some operations, with the matrix containing 1000 rows of 1000 variables (not stating that as any limit but an example) for 1,000,000 values that could be considered parameters by something else. So the use case is somewhat important.
I killed my system when trying to create 4,000,000 variables (just for giggles) because after a few minutes it was still running. 1,000,000 variables took a bit over 4.5 seconds; 2,000,000 took about 1 minute and 50 seconds.
@ballardw Yes, the cost seems to be CPU-bound, and grow exponentially rather than linearly, which might point to a defect in the underlying logic.
100k => 0.5 s
1m => 10 s
1.5m => 1 min
2m => 4 min
3m => 18 min
For the mapping of variable names to PDV locations, the interpreter needs to build a search tree during data step compilation, and adding to such things always grows exponentially.
That's why temporary arrays are the fastest constructs in a data step: no individual names, addressing solely through index.
I asked tech support to have a look at this phenomenon. I'll report if anything interesting comes up.
Something interesting came out of the conversation with tech support.
The SAS interpreter guesses how many variables will be needed as output for the table when the output buffer (not the PDV: this data set output buffer creation logic is used by procedures as well) is created. The reason this estimation is necessary is that some procedures request way too many variables (as in: millions!).
When this estimated size is exceeded, the estimation process reruns for every variable added. There is no plan to fix this at the moment since this has never caused real-world issues. This explains the heavy CPU load.
Notes:
- In a data _null_ step, this issue does not appear as no output buffer is created (but a PDV is).
- I created a table with 8 million variables; this takes about 8 hours and uses 4GB of RAM, which is the maximum I have access to.
- @SimonDawson can tell you more if you are interested.
The key sentence here:
since this has never caused real-world issues
> Just saying...
Partly true, chicken and eggs....
Having said that such large models are typically run in memory, so a fairer comparison would be to look at CAS data. Not too sure what's happening there.
I guess you confuse the number of data items that a data or proc step can handle simultaneously with the need to define a name for each individual item.
As other languages, SAS provides constructs for this. The fastest is a temporary array, and hash objects are similarly quick, and only limited by the memory available to the SAS session:
data _null_;
length n value 8;
declare hash h ();
h.definekey("n");
h.definedata("value");
h.definedone();
do n = 1 to 10000000;
value = rand("uniform") * 1000;
h.add();
end;
run;
Log:
27 data _null_; 28 length n value 8; 29 declare hash h (); 30 h.definekey("n"); 31 h.definedata("value"); 32 h.definedone(); 33 do n = 1 to 10000000; 34 value = rand("uniform") * 1000; 35 h.add(); 36 end; 37 run; NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 9.70 seconds cpu time 2.56 seconds
In just 10 seconds, the step created 20 million numeric values, and built the search tree for one of them.
This on a 2-core pSeries server with a MEMSIZE of 512M.
And there are many other ways of storing information that doesn't require data in that form, so there are other options such as pairing SAS with Hadoop or other big data technologies. Data storage, analysis/processing and modeling are not limited by the same things.
@xtc283x wrote:
Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.
With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.