BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sebster24
Quartz | Level 8

Hello,

I am currently trying to compute Min/ Max using HASH table after using a "grouping" key. The reason for the adoption of hash table is because of the following reasons:

1) The dataset is not sorted.

2) It is a huge dataset (180 million observations);

3) Proc tabulate / proc means gives error stating "insufficient memory". It also takes a lot of time if the dataset is split.

 

The problem I am facing is as follows:

1) The program runs perfectly when defining "every" variable. For example, in this following code, I used table : "SASHELP.demographics" and used variable "TOTALFR" for min max calculation. I used key : "REGION". This provides the correct min max values:

***********************************************************************************************************;
/* THIS CODE WORKS; BUT ONLY ONE VARIABLE AT A TIME : THIS MEANS I WOULD NEED TO CREATE A LOOP */
***********************************************************************************************************;
data _null_;
     set sashelp.demographics end = eof;
     if _n_ = 1 then do;
          call missing(max_val,min_val);
          declare hash h(ordered:'a');
      	   rc = h.definekey('REGION');
     	    rc = h.definedata('REGION','variable','max_val','min_val');
			rc = h.definedone();
     end;
     rc = h.find();	 
variable="TOTALFR";

     if rc = 0 then do;
          if TOTAL_FR > max_val then do;
                max_val = TOTALFR;
                rc = h.replace();
          end;
          if TOTAL_FR < min_val then do;
                min_val = TOTALFR;
                rc = h.replace();
          end;
     end;
     else do;
          max_val = TOTALFR;
          min_val = TOTALFR;
          rc = h.add();
     end;
     if eof then
          rc = h.output(dataset:'WANT_THIS_WORKS');
run;

 

Now, if I trying to modify this same code to make sure that MIN MAX values are populated for all NUMERIC variables using arrays, the results yields incorrect results. What am I doing wrong? I used the following code:

***********************************************************************************************************;
/* THIS CODE DOES NOT WORK . THIS USES HASH TABLES BUT LOOPS THROUGH ALL NUMERIC VARIABLES USING ARRAYS */
***********************************************************************************************************;
data _null_;
    set sashelp.demographics end = eof;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    rc = h.find();
    array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    variable=vname(numvals[i]);
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;      
              rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
         end;
    end;

run;

 

Please help.

regards,

S.

1 ACCEPTED SOLUTION

Accepted Solutions
KachiM
Rhodochrosite | Level 12

I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just

adjusted the statement to fit into the right place. See whether this is what you need?

data _null_;
    set sashelp.shoes end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    	
    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
         	rc = h.output(dataset:'out_01');    
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;
         end;
    end;

run;
proc print data = out_01;
run;

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

I actually guess that if you run out of memory with proc means and class, the same will happen with your hash table, as the memory requirements for both operations are roughly the same.

sebster24
Quartz | Level 8
I tried running the hash statement on the whole 180 million dataset using 3 definekey/grouping variables. No memory issues/ errors were noted and it ran much faster than proc tabulate/ means.
sebster24
Quartz | Level 8

Hello Kurt,

Thank you for your quick response.

The definekey variable can be 1 or 5. Depending on the dataset.

I tried earlier defining the variable as an error prior to executing the find() statement, but was not able to solve the problem. My code is as below:

data _null_;
	set sashelp.demographics(obs=10)end = eof;
	attrib variable format=$32.;
	if _n_ = 1 then do;
		call missing(max_val,min_val);
		declare hash h(ordered:'a');
		rc = h.definekey('REGION','variable');
		rc = h.definedata('REGION','variable','max_val','min_val');
		rc = h.definedone();
	end;
	array Numvals [*] TOTALFR popurban;
	do j=1 to dim(numvals);
		variable=vname(numvals[j]);
		rc = h.find();
		do i=1 to dim(numvals);
			if rc = 0 then do;
				if numvals[i] > max_val then do;
					max_val = numvals[i];
					rc = h.replace();
				end;
				if numvals[i] < min_val then do;
					min_val = numvals[i];
					rc = h.replace();
				end;
			end;
			else do;
				max_val = numvals[i];
				min_val = numvals[i];
				rc = h.add();
			end;
		end;
		if eof then do;
			rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
		end;
	end;
run;

 

I even defined "variable" as a separate array prior to running find().

 

KachiM
Rhodochrosite | Level 12

I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just

adjusted the statement to fit into the right place. See whether this is what you need?

data _null_;
    set sashelp.shoes end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    	
    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
         	rc = h.output(dataset:'out_01');    
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;
         end;
    end;

run;
proc print data = out_01;
run;
rogerjdeangelis
Barite | Level 11
Hash vs Summary Min and max for 4 variables by region for 189,000,000 Obs

HASH            237 seconds  (does not use threads)
Summary(class)   55 seconds
Summary(by)      46 seconds

I did not make sure caches were empty and I did not run batch.
I did make mutiple runs in different order.

Terribly I/O bound?

May be able to cut in half (23 seconds)

HAVE
====

Up to 40 obs from wre.wre180 total obs=179,725,000

Obs    REGION  STORES   SALES INVENTORY RETURNS

  1    Africa      12   29761    191821     769
  2    Africa      12   29761    191821     769
  3    Africa      12   29761    191821     769
  4    Africa      12   29761    191821     769
  5    Africa      12   29761    191821     769

REGION
-------------------------
Africa
Asia
Canada
Central America/Caribbean
Eastern Europe
Middle East
Pacific
South America
United States
Western Europe


WANT

Up to 40 obs from WORK.OUT_01 total obs=40

Obs    REGION   VARIABLE      MAX_VAL  MIN_VAL

  1    Africa   INVENTORY     1063251     3247
  2    Africa   RETURNS         10124       29
  3    Africa   SALES          360209      801
  4    Africa   STORES             25        1

  5    Asia     INVENTORY      469007      455
  6    Asia     RETURNS          2941       10
  7    Asia     SALES          149013      937
  8    Asia     STORES             21        1
....


* create 180 million observations;
data wre.wre180(bufsize=128k bufno=500 keep=region inventory returns sales stores);
  set sashelp.shoes;
  do i=1 to 455000;
    output;
  end;
run;quit;

*____
| __ ) _   _
|  _ \| | | |
| |_) | |_| |
|____/ \__, |
       |___/
;

proc summary data=wre.wre180 max min nway;
by region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
    max=max_inventory max_returns max_sales max_stores
    min=min_inventory min_returns min_sales stores
;
run;quit;

NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 10 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           46.31 seconds
      user cpu time       30.45 seconds
      system cpu time     15.78 seconds
      memory              5245.25k
      OS Memory           69824.00k
      Timestamp           02/17/2017 08:04:25 PM
      Step Count                        54  Switch Count  1

* ____ _
 / ___| | __ _ ___ ___
| |   | |/ _` / __/ __|
| |___| | (_| \__ \__ \
 \____|_|\__,_|___/___/

;

proc summary data=wre.wre180 max min nway;
class region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
    max=max_inventory max_returns max_sales max_stores
    min=min_inventory min_returns min_sales stores
;
run;quit;

NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 11 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           55.32 seconds
      user cpu time       1:44.84
      system cpu time     21.73 seconds
      memory              10471.93k
      OS Memory           72648.00k
      Timestamp           02/17/2017 07:39:39 PM
      Step Count                        37  Switch Count  0

*_   _           _
| | | | __ _ ___| |__
| |_| |/ _` / __| '_ \
|  _  | (_| \__ \ | | |
|_| |_|\__,_|___/_| |_|

;


data _null_;
    set wre.wre180 end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);

    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
               rc = h.output(dataset:'out_01');
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;

         end;
    end;

run;
proc print data = out_01;
run;

NOTE: The data set WORK.OUT_01 has 40 observations and 4 variables.
NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: DATA statement used (Total process time):
      real time           3:57.20
      user cpu time       3:41.56
      system cpu time     15.46 seconds
      memory              4190.84k
      OS Memory           67508.00k
      Timestamp           02/17/2017 07:45:22 PM
      Step Count                        38  Switch Count  8


sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 3099 views
  • 2 likes
  • 4 in conversation