BookmarkSubscribeRSS Feed
xtc283x
Quartz | Level 8

Historically, the SAS PDV only permitted about 32k variables. Has this changed in recent years? If so, what is the current maximum number of variables allowed by the PDV?

14 REPLIES 14
ballardw
Super User

If you get close to running out of variable allocations you need to seriously consider what you/how you are doing something.

 

I just created an empty data set with 400,000 numeric variables.

ChrisNZ
Tourmaline | Level 20

I just created a data set with one million variables.

xtc283x
Quartz | Level 8

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.

SASKiwi
PROC Star

SAS Viya goes beyond some of the constraints of traditional SAS (SAS Release 9) also. If you want further guidance then tell us more about your use case. 

ballardw
Super User

@xtc283x wrote:

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.


 With all due respect "parameters" are not the same as "variables". Proc Iml and matrix could be considered a single "parameter" for some operations, with the matrix containing 1000 rows of 1000 variables (not stating that as any limit but an example) for 1,000,000 values that could be considered parameters by something else. So the use case is somewhat important.

 

I killed my system when trying to create 4,000,000 variables (just for giggles) because after a few minutes it was still running. 1,000,000 variables took a bit over 4.5 seconds; 2,000,000 took about 1 minute and 50 seconds.

ChrisNZ
Tourmaline | Level 20

@ballardw Yes, the cost seems to be CPU-bound, and grow exponentially rather than linearly, which might point to a defect in the underlying logic. 

100k => 0.5 s

1m => 10 s

1.5m => 1 min

2m =>   4 min

3m => 18 min

Kurt_Bremser
Super User

For the mapping of variable names to PDV locations, the interpreter needs to build a search tree during data step compilation, and adding to such things always grows exponentially.

 

That's why temporary arrays are the fastest constructs in a data step: no individual names, addressing solely through index.

ChrisNZ
Tourmaline | Level 20

I asked tech support to have a look at this phenomenon. I'll report if anything interesting comes up.

ChrisNZ
Tourmaline | Level 20

Something interesting came out of the conversation with tech support.

 

The SAS interpreter guesses how many variables will be needed as output for the table when the output buffer (not the PDV: this data set output buffer creation logic is used by procedures as well) is created. The reason this estimation is necessary is that some procedures request way too many variables (as in: millions!). 

 

When this estimated size is exceeded, the estimation process reruns for every variable added. There is no plan to fix this at the moment since this has never caused real-world issues. This explains the heavy CPU load.

 

Notes:

- In a data _null_ step, this issue does not appear as no output buffer is created (but a PDV is).

- I created a table with 8 million variables; this takes about 8 hours and uses 4GB of RAM, which is the maximum I have access to.

- @SimonDawson can tell you more if you are interested.

 

 

xtc283x
Quartz | Level 8
The absence of 'real world issues' in a SAS world may not be due to the demand for millions and/or billions of features, parameters or variables since Google engineers are running algorithms of that magnitude on a routine basis, and more due SAS' inability to deliver and perform at that level.
Just saying...
ChrisNZ
Tourmaline | Level 20

> Just saying...

Partly true, chicken and eggs....

Having said that such large models are typically run in memory, so a fairer comparison would be to look at CAS data. Not too sure what's happening there.

 

Kurt_Bremser
Super User

I guess you confuse the number of data items that a data or proc step can handle simultaneously with the need to define a name for each individual item.

As other languages, SAS provides constructs for this. The fastest is a temporary array, and hash objects are similarly quick, and only limited by the memory available to the SAS session:

data _null_;
length n value 8;
declare hash h ();
h.definekey("n");
h.definedata("value");
h.definedone();
do n = 1 to 10000000;
  value = rand("uniform") * 1000;
  h.add();
end;
run;

Log:

27         data _null_;
28         length n value 8;
29         declare hash h ();
30         h.definekey("n");
31         h.definedata("value");
32         h.definedone();
33         do n = 1 to 10000000;
34           value = rand("uniform") * 1000;
35           h.add();
36         end;
37         run;

NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
      real time           9.70 seconds
      cpu time            2.56 seconds
      

In just 10 seconds, the step created 20 million numeric values, and built the search tree for one of them. 

This on a 2-core pSeries server with a MEMSIZE of 512M.

 

Reeza
Super User

And there are many other ways of storing information that doesn't require data in that form, so there are other options such as pairing SAS with Hadoop or other big data technologies. Data storage, analysis/processing and modeling are not limited by the same things.

 


@xtc283x wrote:

Good to know that the PDV permits more than 32k variables but that still doesn't pin the precise upper bound.

With all due respect, things have changed a lot in recent years wrt any 'norm' about the max number of features, e.g., algorithms with millions, even billions, of parameters are pretty routine in the ML community.


 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 3978 views
  • 15 likes
  • 6 in conversation