DATA Step, Macro, Functions and more

Using HASH tables to find min max after applying multiple grouping/key

Accepted Solution Solved
Reply
Contributor
Posts: 44
Accepted Solution

Using HASH tables to find min max after applying multiple grouping/key

Hello,

I am currently trying to compute Min/ Max using HASH table after using a "grouping" key. The reason for the adoption of hash table is because of the following reasons:

1) The dataset is not sorted.

2) It is a huge dataset (180 million observations);

3) Proc tabulate / proc means gives error stating "insufficient memory". It also takes a lot of time if the dataset is split.

 

The problem I am facing is as follows:

1) The program runs perfectly when defining "every" variable. For example, in this following code, I used table : "SASHELP.demographics" and used variable "TOTALFR" for min max calculation. I used key : "REGION". This provides the correct min max values:

***********************************************************************************************************;
/* THIS CODE WORKS; BUT ONLY ONE VARIABLE AT A TIME : THIS MEANS I WOULD NEED TO CREATE A LOOP */
***********************************************************************************************************;
data _null_;
     set sashelp.demographics end = eof;
     if _n_ = 1 then do;
          call missing(max_val,min_val);
          declare hash h(ordered:'a');
      	   rc = h.definekey('REGION');
     	    rc = h.definedata('REGION','variable','max_val','min_val');
			rc = h.definedone();
     end;
     rc = h.find();	 
variable="TOTALFR";

     if rc = 0 then do;
          if TOTAL_FR > max_val then do;
                max_val = TOTALFR;
                rc = h.replace();
          end;
          if TOTAL_FR < min_val then do;
                min_val = TOTALFR;
                rc = h.replace();
          end;
     end;
     else do;
          max_val = TOTALFR;
          min_val = TOTALFR;
          rc = h.add();
     end;
     if eof then
          rc = h.output(dataset:'WANT_THIS_WORKS');
run;

 

Now, if I trying to modify this same code to make sure that MIN MAX values are populated for all NUMERIC variables using arrays, the results yields incorrect results. What am I doing wrong? I used the following code:

***********************************************************************************************************;
/* THIS CODE DOES NOT WORK . THIS USES HASH TABLES BUT LOOPS THROUGH ALL NUMERIC VARIABLES USING ARRAYS */
***********************************************************************************************************;
data _null_;
    set sashelp.demographics end = eof;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    rc = h.find();
    array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    variable=vname(numvals[i]);
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;      
              rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
         end;
    end;

run;

 

Please help.

regards,

S.


Accepted Solutions
Solution
‎02-17-2017 08:05 AM
Super Contributor
Posts: 254

Re: Using HASH tables to find min max after applying multiple grouping/key

I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just

adjusted the statement to fit into the right place. See whether this is what you need?

data _null_;
    set sashelp.shoes end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    	
    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
         	rc = h.output(dataset:'out_01');    
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;
         end;
    end;

run;
proc print data = out_01;
run;

View solution in original post


All Replies
Super User
Posts: 6,933

Re: Using HASH tables to find min max after applying multiple grouping/key

[ Edited ]

You do your find() before you set variable=

 

How many "regions" (or whatever is the group key in your original dataset) do you have?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 6,933

Re: Using HASH tables to find min max after applying multiple grouping/key

I actually guess that if you run out of memory with proc means and class, the same will happen with your hash table, as the memory requirements for both operations are roughly the same.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 44

Re: Using HASH tables to find min max after applying multiple grouping/key

I tried running the hash statement on the whole 180 million dataset using 3 definekey/grouping variables. No memory issues/ errors were noted and it ran much faster than proc tabulate/ means.
Contributor
Posts: 44

Re: Using HASH tables to find min max after applying multiple grouping/key

[ Edited ]

Hello Kurt,

Thank you for your quick response.

The definekey variable can be 1 or 5. Depending on the dataset.

I tried earlier defining the variable as an error prior to executing the find() statement, but was not able to solve the problem. My code is as below:

data _null_;
	set sashelp.demographics(obs=10)end = eof;
	attrib variable format=$32.;
	if _n_ = 1 then do;
		call missing(max_val,min_val);
		declare hash h(ordered:'a');
		rc = h.definekey('REGION','variable');
		rc = h.definedata('REGION','variable','max_val','min_val');
		rc = h.definedone();
	end;
	array Numvals [*] TOTALFR popurban;
	do j=1 to dim(numvals);
		variable=vname(numvals[j]);
		rc = h.find();
		do i=1 to dim(numvals);
			if rc = 0 then do;
				if numvals[i] > max_val then do;
					max_val = numvals[i];
					rc = h.replace();
				end;
				if numvals[i] < min_val then do;
					min_val = numvals[i];
					rc = h.replace();
				end;
			end;
			else do;
				max_val = numvals[i];
				min_val = numvals[i];
				rc = h.add();
			end;
		end;
		if eof then do;
			rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
		end;
	end;
run;

 

I even defined "variable" as a separate array prior to running find().

 

Solution
‎02-17-2017 08:05 AM
Super Contributor
Posts: 254

Re: Using HASH tables to find min max after applying multiple grouping/key

I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just

adjusted the statement to fit into the right place. See whether this is what you need?

data _null_;
    set sashelp.shoes end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);
    	
    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;   
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
         	rc = h.output(dataset:'out_01');    
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;
         end;
    end;

run;
proc print data = out_01;
run;
Valued Guide
Posts: 505

Re: Using HASH tables to find min max after applying multiple grouping/key

Hash vs Summary Min and max for 4 variables by region for 189,000,000 Obs

HASH            237 seconds  (does not use threads)
Summary(class)   55 seconds
Summary(by)      46 seconds

I did not make sure caches were empty and I did not run batch.
I did make mutiple runs in different order.

Terribly I/O bound?

May be able to cut in half (23 seconds)

HAVE
====

Up to 40 obs from wre.wre180 total obs=179,725,000

Obs    REGION  STORES   SALES INVENTORY RETURNS

  1    Africa      12   29761    191821     769
  2    Africa      12   29761    191821     769
  3    Africa      12   29761    191821     769
  4    Africa      12   29761    191821     769
  5    Africa      12   29761    191821     769

REGION
-------------------------
Africa
Asia
Canada
Central America/Caribbean
Eastern Europe
Middle East
Pacific
South America
United States
Western Europe


WANT

Up to 40 obs from WORK.OUT_01 total obs=40

Obs    REGION   VARIABLE      MAX_VAL  MIN_VAL

  1    Africa   INVENTORY     1063251     3247
  2    Africa   RETURNS         10124       29
  3    Africa   SALES          360209      801
  4    Africa   STORES             25        1

  5    Asia     INVENTORY      469007      455
  6    Asia     RETURNS          2941       10
  7    Asia     SALES          149013      937
  8    Asia     STORES             21        1
....


* create 180 million observations;
data wre.wre180(bufsize=128k bufno=500 keep=region inventory returns sales stores);
  set sashelp.shoes;
  do i=1 to 455000;
    output;
  end;
run;quit;

*____
| __ ) _   _
|  _ \| | | |
| |_) | |_| |
|____/ \__, |
       |___/
;

proc summary data=wre.wre180 max min nway;
by region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
    max=max_inventory max_returns max_sales max_stores
    min=min_inventory min_returns min_sales stores
;
run;quit;

NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 10 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           46.31 seconds
      user cpu time       30.45 seconds
      system cpu time     15.78 seconds
      memory              5245.25k
      OS Memory           69824.00k
      Timestamp           02/17/2017 08:04:25 PM
      Step Count                        54  Switch Count  1

* ____ _
 / ___| | __ _ ___ ___
| |   | |/ _` / __/ __|
| |___| | (_| \__ \__ \
 \____|_|\__,_|___/___/

;

proc summary data=wre.wre180 max min nway;
class region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
    max=max_inventory max_returns max_sales max_stores
    min=min_inventory min_returns min_sales stores
;
run;quit;

NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 11 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           55.32 seconds
      user cpu time       1:44.84
      system cpu time     21.73 seconds
      memory              10471.93k
      OS Memory           72648.00k
      Timestamp           02/17/2017 07:39:39 PM
      Step Count                        37  Switch Count  0

*_   _           _
| | | | __ _ ___| |__
| |_| |/ _` / __| '_ \
|  _  | (_| \__ \ | | |
|_| |_|\__,_|___/_| |_|

;


data _null_;
    set wre.wre180 end = eof;
    array Numvals _NUMERIC_;
    attrib variable format=$32.;
    if _n_ = 1 then do;
         call missing(max_val,min_val);
         declare hash h(ordered:'a');
         rc = h.definekey('REGION','variable');
         rc = h.definedata('REGION','variable','max_val','min_val');
         rc = h.definedone();
    end;
    *rc = h.find();
    *array Numvals _NUMERIC_;
    do i=1 to dim(numvals);

    variable=vname(numvals[i]);
        rc = h.find();
         if rc = 0 then do;
              if numvals[i] > max_val then do;
                    max_val = numvals[i];
                    rc = h.replace();
              end;
              if numvals[i] < min_val then do;
                    min_val = numvals[i];
                    rc = h.replace();
              end;

         end;
         else do;
              max_val = numvals[i];
              min_val = numvals[i];
              rc = h.add();
         end;
         if eof then do;
               rc = h.output(dataset:'out_01');
              *rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
            stop;

         end;
    end;

run;
proc print data = out_01;
run;

NOTE: The data set WORK.OUT_01 has 40 observations and 4 variables.
NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: DATA statement used (Total process time):
      real time           3:57.20
      user cpu time       3:41.56
      system cpu time     15.46 seconds
      memory              4190.84k
      OS Memory           67508.00k
      Timestamp           02/17/2017 07:45:22 PM
      Step Count                        38  Switch Count  8


Super User
Posts: 6,933

Re: Using HASH tables to find min max after applying multiple grouping/key

Very nice work, @rogerjdeangelis!

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 414 views
  • 2 likes
  • 4 in conversation