BookmarkSubscribeRSS Feed
gabriel_k
Obsidian | Level 7

Hello, 

 

I have large sas tables (100 mln lines) and up to 200 columns in each one. Could someone share a code that would produce a summary statistics on both categorical and numeric variables. 

 

I need something similar but more stats (count, min, max, avg, std, q1, q3...etc):

 

https://communities.sas.com/t5/SAS-Procedures/Count-missing-values-of-100-variables-in-with-column-o...

 

Thanks in advance. 

10 REPLIES 10
gabriel_k
Obsidian | Level 7

Is there other option? I get the insufficient memory error.

 

 

andreas_lds
Jade | Level 19

Afaik both procs are optimized to do the job, so you will hardly find anything more efficient.

 

EDIT: But, if you would post example input data and the required result, another approach could exist.

gabriel_k
Obsidian | Level 7

Just usual dataset with both numeric and character variables. The problem is that the tables are probably to large to run on a server set up. 

Kurt_Bremser
Super User

If you run out of memory, you need to sort the tables first by the class variables, so you can use "by" instead of "class" in proc summary. Running statistics for one group at a time (that's what by-processing does) needs next to no memory.

 

ballardw
Super User

@gabriel_k wrote:

Is there other option? I get the insufficient memory error.

 

 


1) Show your code, we don't have a clue what you actually did

 

2) Proc freq has the possibility of creating a LOT of output and the tables are built in memory for display in the results window.

    If there are variables that you really aren't interested such as a unique identifier variable, which is going to create one row of output for each value in proc freq, you might consider dropping them from the table by using (drop=variables) data set option with the proc statement data= .

 

You might want to send the output direct to a different ODS destination than results;

/* turn o

ods html close;

ods rtf file="<path>\summary.rtf";

proc freq data=have;

run;

proc means data=have;

run;

ods rtf close;

mkeintz
PROC Star

If you're getting a shortfall in memory, it's more likely due to frequency tabulations than the calculation of parametric statistics.  If so, then just do fewer variables at a time.

 

Since you have yet to show us the requested code that produced the problem, I will assume you are asking the stat procedure(s) to process a maximum number of variables, which is why I made the suggestion above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
AhmedAl_Attar
Ammonite | Level 13

Hi @gabriel_k

 

SAS Enterprise Guide has that feature within the Characterize Data Wizard!
The Attached macro is a wrapped version the code generated by EG.

Here is how you can run it in your SAS session 😊:

%_EG_CHARACT_WRAPPER_ (p_inDsName=<LIB.>DataSet, p_inCatMaxLimit=30);  /* p_inCatMaxLimit: Maximum count for unique category value */

Note: For processing large SAS data sets, you can specify the following options on the SAS Command

-MEMSIZE 4G -SORTSIZE 2G

 

Enjoy,
Ahmed

ChrisNZ
Tourmaline | Level 20

What information do you want for the character variables?

ChrisNZ
Tourmaline | Level 20

Some statistics are very hungry when the cardinality is high. That's just the way it is.

Q1, median, etc potentially need all the values to be evaluated at the same time.

So consider whether you really need them.

 

Likewise for the number of distinct values. 

 

If you need them, then you just have to bite the bullet and spend the resources.

This may mean deriving them with smaller variable batches, and making more runs.

That's expensive as I said.

 

Giving SAS as much RAM as possible helps (options REALMEMSIZE for the SAS session, SUMSIZE for proc means, the confusingly named option UBUFSIZE -or BUFFERSIZE as it used to be called- for proc sql) if you have access to them. 

MEMSIZE includes paging space, which you have to be careful not to use for your purpose. It's much faster to run several batches than to page data.

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2969 views
  • 2 likes
  • 7 in conversation