BookmarkSubscribeRSS Feed
data_null__
Jade | Level 19

Maybe you saw this SAScomminity.org tip HASH sort (SAScommunity.org tip-of-the-day 08OCT2013)

I applied CALL VNEXT to generalize it somewhat.


Peter_C
Rhodochrosite | Level 12

It is the roll-up of GBs of data that prompted a suggestion of proc summary operating on blocks or subsets.  As proc summary is more memory-based and reduces i/o on output it provides a trade-off but demands memory so that is an issue to manage. Of course a hash approach would be top if the result set could sit in available memory.

  

To start discovery of potential for all these alternatives we would do a nLevels analysis of the sort keys or roll-up variables, with proc freq.

peter

Peter_C
Rhodochrosite | Level 12

with a test run to create a 6GB sas dataset, then first sort it in blocks then do a rollup (proc summary) in blocks of increasing size, here is some code to demonstrate the blocks suggestion I was making

data tlarge ;

set sampsio.empinfo; * should be available, just try it;

do _n_=1 to 1e5;

output;

end;

run;

%let byv = divcode division ;

libname user ( './' work ) ;

data _null_;

retain start 1 ;

do exp=4 to 8 ;

  block=10 ** exp ;

call execute( 'proc sort data= tlarge( firstobs=' !! put( start, 9.-L ) ) ;

call execute( ' obs = ' !! put( block, 9.-L ) ) ;

call execute( ') out=tlarg_' !! put( start, 9.-L ) ) ;

call execute( "; by &byv ; run ;" );

start = block+1 ;

end ;

run ;

data _null_;

retain start 1 ;

do exp=4 to 8 ;

  block=10 ** exp ;

call execute( 'proc summary data= tlarge( firstobs=' !! put( start, 9.-L ) ) ;

call execute( ' obs = ' !! put( block, 9.-L ) ) ;

call execute( ') missing noprint nway ; ' );

call execute( "class &byv ; var _numeric_ ;" );

call execute( 'output sum=  out= tlasum' !!  put( start, 9.-L ) ) ;

call execute( "; run ;" );

start = block+1 ;

end ;

run ;

jakarman
Barite | Level 11

It has always been the best approach to make your data smaller while not losing information. It will always be so when you are going beyond the comfort zone of your machine.

It will never have unlimited speed and unlimited storage resources. As at the moment you are hitting that you need to think harder. More clever approach to the data or you analyses.

@Gergely, your proposal is going back to that question. When the assumptions are not met still an issue. It also depend how often it should be executed.

A Hadoop style is asking a lot of storage, even more as just sorting. It is designed to be most read-only access while have a lot of spread data duplications.

---->-- ja karman --<-----

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 3762 views
  • 0 likes
  • 8 in conversation