Hello,
I am currently trying to compute Min/ Max using HASH table after using a "grouping" key. The reason for the adoption of hash table is because of the following reasons:
1) The dataset is not sorted.
2) It is a huge dataset (180 million observations);
3) Proc tabulate / proc means gives error stating "insufficient memory". It also takes a lot of time if the dataset is split.
The problem I am facing is as follows:
1) The program runs perfectly when defining "every" variable. For example, in this following code, I used table : "SASHELP.demographics" and used variable "TOTALFR" for min max calculation. I used key : "REGION". This provides the correct min max values:
***********************************************************************************************************; /* THIS CODE WORKS; BUT ONLY ONE VARIABLE AT A TIME : THIS MEANS I WOULD NEED TO CREATE A LOOP */ ***********************************************************************************************************; data _null_; set sashelp.demographics end = eof; if _n_ = 1 then do; call missing(max_val,min_val); declare hash h(ordered:'a'); rc = h.definekey('REGION'); rc = h.definedata('REGION','variable','max_val','min_val'); rc = h.definedone(); end; rc = h.find(); variable="TOTALFR"; if rc = 0 then do; if TOTAL_FR > max_val then do; max_val = TOTALFR; rc = h.replace(); end; if TOTAL_FR < min_val then do; min_val = TOTALFR; rc = h.replace(); end; end; else do; max_val = TOTALFR; min_val = TOTALFR; rc = h.add(); end; if eof then rc = h.output(dataset:'WANT_THIS_WORKS'); run;
Now, if I trying to modify this same code to make sure that MIN MAX values are populated for all NUMERIC variables using arrays, the results yields incorrect results. What am I doing wrong? I used the following code:
***********************************************************************************************************; /* THIS CODE DOES NOT WORK . THIS USES HASH TABLES BUT LOOPS THROUGH ALL NUMERIC VARIABLES USING ARRAYS */ ***********************************************************************************************************; data _null_; set sashelp.demographics end = eof; attrib variable format=$32.; if _n_ = 1 then do; call missing(max_val,min_val); declare hash h(ordered:'a'); rc = h.definekey('REGION','variable'); rc = h.definedata('REGION','variable','max_val','min_val'); rc = h.definedone(); end; rc = h.find(); array Numvals _NUMERIC_; do i=1 to dim(numvals); variable=vname(numvals[i]); if rc = 0 then do; if numvals[i] > max_val then do; max_val = numvals[i]; rc = h.replace(); end; if numvals[i] < min_val then do; min_val = numvals[i]; rc = h.replace(); end; end; else do; max_val = numvals[i]; min_val = numvals[i]; rc = h.add(); end; if eof then do; rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))'); end; end; run;
Please help.
regards,
S.
I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just
adjusted the statement to fit into the right place. See whether this is what you need?
data _null_;
set sashelp.shoes end = eof;
array Numvals _NUMERIC_;
attrib variable format=$32.;
if _n_ = 1 then do;
call missing(max_val,min_val);
declare hash h(ordered:'a');
rc = h.definekey('REGION','variable');
rc = h.definedata('REGION','variable','max_val','min_val');
rc = h.definedone();
end;
*rc = h.find();
*array Numvals _NUMERIC_;
do i=1 to dim(numvals);
variable=vname(numvals[i]);
rc = h.find();
if rc = 0 then do;
if numvals[i] > max_val then do;
max_val = numvals[i];
rc = h.replace();
end;
if numvals[i] < min_val then do;
min_val = numvals[i];
rc = h.replace();
end;
end;
else do;
max_val = numvals[i];
min_val = numvals[i];
rc = h.add();
end;
if eof then do;
rc = h.output(dataset:'out_01');
*rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
stop;
end;
end;
run;
proc print data = out_01;
run;
You do your find() before you set variable=
How many "regions" (or whatever is the group key in your original dataset) do you have?
I actually guess that if you run out of memory with proc means and class, the same will happen with your hash table, as the memory requirements for both operations are roughly the same.
Hello Kurt,
Thank you for your quick response.
The definekey variable can be 1 or 5. Depending on the dataset.
I tried earlier defining the variable as an error prior to executing the find() statement, but was not able to solve the problem. My code is as below:
data _null_;
set sashelp.demographics(obs=10)end = eof;
attrib variable format=$32.;
if _n_ = 1 then do;
call missing(max_val,min_val);
declare hash h(ordered:'a');
rc = h.definekey('REGION','variable');
rc = h.definedata('REGION','variable','max_val','min_val');
rc = h.definedone();
end;
array Numvals [*] TOTALFR popurban;
do j=1 to dim(numvals);
variable=vname(numvals[j]);
rc = h.find();
do i=1 to dim(numvals);
if rc = 0 then do;
if numvals[i] > max_val then do;
max_val = numvals[i];
rc = h.replace();
end;
if numvals[i] < min_val then do;
min_val = numvals[i];
rc = h.replace();
end;
end;
else do;
max_val = numvals[i];
min_val = numvals[i];
rc = h.add();
end;
end;
if eof then do;
rc = h.output(dataset:'WANT_DOES_NOT_WORK (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
end;
end;
run;
I even defined "variable" as a separate array prior to running find().
I do not have sashelp.demographics in my University Edition. Try sashelp.shoes which has 4 numeic variables. I just
adjusted the statement to fit into the right place. See whether this is what you need?
data _null_;
set sashelp.shoes end = eof;
array Numvals _NUMERIC_;
attrib variable format=$32.;
if _n_ = 1 then do;
call missing(max_val,min_val);
declare hash h(ordered:'a');
rc = h.definekey('REGION','variable');
rc = h.definedata('REGION','variable','max_val','min_val');
rc = h.definedone();
end;
*rc = h.find();
*array Numvals _NUMERIC_;
do i=1 to dim(numvals);
variable=vname(numvals[i]);
rc = h.find();
if rc = 0 then do;
if numvals[i] > max_val then do;
max_val = numvals[i];
rc = h.replace();
end;
if numvals[i] < min_val then do;
min_val = numvals[i];
rc = h.replace();
end;
end;
else do;
max_val = numvals[i];
min_val = numvals[i];
rc = h.add();
end;
if eof then do;
rc = h.output(dataset:'out_01');
*rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
stop;
end;
end;
run;
proc print data = out_01;
run;
Hash vs Summary Min and max for 4 variables by region for 189,000,000 Obs
HASH 237 seconds (does not use threads)
Summary(class) 55 seconds
Summary(by) 46 seconds
I did not make sure caches were empty and I did not run batch.
I did make mutiple runs in different order.
Terribly I/O bound?
May be able to cut in half (23 seconds)
HAVE
====
Up to 40 obs from wre.wre180 total obs=179,725,000
Obs REGION STORES SALES INVENTORY RETURNS
1 Africa 12 29761 191821 769
2 Africa 12 29761 191821 769
3 Africa 12 29761 191821 769
4 Africa 12 29761 191821 769
5 Africa 12 29761 191821 769
REGION
-------------------------
Africa
Asia
Canada
Central America/Caribbean
Eastern Europe
Middle East
Pacific
South America
United States
Western Europe
WANT
Up to 40 obs from WORK.OUT_01 total obs=40
Obs REGION VARIABLE MAX_VAL MIN_VAL
1 Africa INVENTORY 1063251 3247
2 Africa RETURNS 10124 29
3 Africa SALES 360209 801
4 Africa STORES 25 1
5 Asia INVENTORY 469007 455
6 Asia RETURNS 2941 10
7 Asia SALES 149013 937
8 Asia STORES 21 1
....
* create 180 million observations;
data wre.wre180(bufsize=128k bufno=500 keep=region inventory returns sales stores);
set sashelp.shoes;
do i=1 to 455000;
output;
end;
run;quit;
*____
| __ ) _ _
| _ \| | | |
| |_) | |_| |
|____/ \__, |
|___/
;
proc summary data=wre.wre180 max min nway;
by region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
max=max_inventory max_returns max_sales max_stores
min=min_inventory min_returns min_sales stores
;
run;quit;
NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 10 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 46.31 seconds
user cpu time 30.45 seconds
system cpu time 15.78 seconds
memory 5245.25k
OS Memory 69824.00k
Timestamp 02/17/2017 08:04:25 PM
Step Count 54 Switch Count 1
* ____ _
/ ___| | __ _ ___ ___
| | | |/ _` / __/ __|
| |___| | (_| \__ \__ \
\____|_|\__,_|___/___/
;
proc summary data=wre.wre180 max min nway;
class region;
var inventory returns sales stores;
output out=maxmin(drop=_:)
max=max_inventory max_returns max_sales max_stores
min=min_inventory min_returns min_sales stores
;
run;quit;
NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: The data set WORK.MAXMIN has 11 observations and 9 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 55.32 seconds
user cpu time 1:44.84
system cpu time 21.73 seconds
memory 10471.93k
OS Memory 72648.00k
Timestamp 02/17/2017 07:39:39 PM
Step Count 37 Switch Count 0
*_ _ _
| | | | __ _ ___| |__
| |_| |/ _` / __| '_ \
| _ | (_| \__ \ | | |
|_| |_|\__,_|___/_| |_|
;
data _null_;
set wre.wre180 end = eof;
array Numvals _NUMERIC_;
attrib variable format=$32.;
if _n_ = 1 then do;
call missing(max_val,min_val);
declare hash h(ordered:'a');
rc = h.definekey('REGION','variable');
rc = h.definedata('REGION','variable','max_val','min_val');
rc = h.definedone();
end;
*rc = h.find();
*array Numvals _NUMERIC_;
do i=1 to dim(numvals);
variable=vname(numvals[i]);
rc = h.find();
if rc = 0 then do;
if numvals[i] > max_val then do;
max_val = numvals[i];
rc = h.replace();
end;
if numvals[i] < min_val then do;
min_val = numvals[i];
rc = h.replace();
end;
end;
else do;
max_val = numvals[i];
min_val = numvals[i];
rc = h.add();
end;
if eof then do;
rc = h.output(dataset:'out_01');
*rc = h.output(dataset:'"WANT_DOES_NOT_WORK||REGION" (where=(upcase(strip(variable)) not IN ("MIN_VAL","MAX_VAL","RC"))');
stop;
end;
end;
run;
proc print data = out_01;
run;
NOTE: The data set WORK.OUT_01 has 40 observations and 4 variables.
NOTE: There were 179725000 observations read from the data set WRE.WRE180.
NOTE: DATA statement used (Total process time):
real time 3:57.20
user cpu time 3:41.56
system cpu time 15.46 seconds
memory 4190.84k
OS Memory 67508.00k
Timestamp 02/17/2017 07:45:22 PM
Step Count 38 Switch Count 8
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.