topic Aggregation using hashing: Repeat summarized results without the use of a hash iterator in SAS Programming

Aggregation using hashing: Repeat summarized results without the use of a hash iterator

left — Fri, 20 Mar 2020 12:20:47 GMT

This is follow-up question of Aggregation using hashing: Repeat summarized results for each observation in by-group.

Would it be possible to implement a solution using hashing without relying on a hash iterator (or an additional DoW-Loop), but by a further hash table lookup?

I wonder if the smart combination of aggravated summary result in hash table [out] and the detailed data in hash table [detail] can be used for a look-up to achieve this?

Here is the underlying code where a possible solution could be done or explained why it cannot be done without a hash iterator or a further DoW-Loop.

/* 01 */ data _null_;
/* 01a */    * Hash table [out] for aggregated display (no duplicate keys);
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_unq", "_sum", "_mean", "_sum_alv", "_n");
/* 06 */     out.defineDone();
/* 07 */ 
/* 07a */    * Hash table [unq] used for counting unique occurrences of value [systolic] within the key of hash table [h];
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();

/* 10a */    * Hash table [detail] for detailed data on patients status (alive/ dead) and systolic blood pressure;
/* 10b */    dcl hash detail (ordered: "a", multidata: "y");
/* 10c */    detail.defineKey("smoking_status", "bp_status", "sex");
/* 10d */    detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status");
/* 10e */    dcl hiter h_detail ("detail");     * Iterator to be able to walk through detail hash table [detail];
/* 10f */    detail.defineDone();

/* 10g */    * Hash table [det_out] for output of detailed data (updated with aggregated values from hash table [h];
/* 10h */    dcl hash det_out(ordered: "a", multidata: "y");
/* 10i */    det_out.defineKey("smoking_status", "bp_status", "sex");
/* 10j */    det_out.defineData("smoking_status", "bp_status", "sex", "systolic", "_unq", "_sum", "_mean", "status", "_sum_alv", "_n");
/* 10k */    det_out.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 13a */        detail.add();  * fill hash table [detail] for each observation;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n, _sum_alv);
/* 15 */         _sum+systolic;             * _sum=sum(_sum, systolic);
/* 16 */         _n+1;                      * _n=sum(_n, 1);
/* 16a */        _sum_alv=sum(_sum_alv, status =: "A");      * Count only patients alive;
/* 16b */        * Check if value is unique within its "by-group"/ key;
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq+1;                * _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;
/* 21 */         _mean=_sum/_n;
/* 21a */        * Update hash table [h] (aggregated values only);
/* 22 */         out.replace();
/* 23 */     end;

/* 23a */    rc=h_detail.first();    * Place "pointer" to first item in hash table [detail];
/* 23b */    do while (rc=0);
/* 23c */        * Write data portion of hash table [h] into PDV with "h.find()" (for each matching key);
/* 23d */        if out.find()=0 then det_out.add();
/* 23e */        rc=h_detail.next();
/* 23f */    end;

/* 23g */    * Output of final hash tables;
/* 24 */     out.output    (dataset: "_hash_agg");      * [h]   aggravated dataset (1 entry for each key);
/* 24a */    det_out.output(dataset: "_hash_agg_all");  * [all] aggravated values for each observation (duplicate keys);

/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)"
/* 29a */           _sum_alv="Patients alive";
/* 29b */    stop;
/* 30 */ run;

Re: Aggregation using hashing: Repeat summarized results without the use of a hash iterator

RichardDeVen — Fri, 20 Mar 2020 21:58:35 GMT

Can you describe the data set _heart_src ? Is there one row per person ? per person bp reading ? Do you have some sample data ?

Re: Aggregation using hashing: Repeat summarized results without the use of a hash iterator

left — Fri, 20 Mar 2020 23:07:07 GMT

Thank you for your reply, @RichardDeVen. The sample data is an excerpt from [sashelp.heart]:

data _heart_src;
    set sashelp.heart (keep=smoking_status bp_status sex systolic status);
    label   sex="Sex"
            status="Survival status"
            systolic="Systolic";
run;

The data set [_heart_src] has one row per subject as shown in this screenshot:

Sample data

My mentioned code gives me the following result:

Resulting output from sample data after data step with hashing + hash iterator

So this is what I want, but without using the hash iterator. I look for a solution with a further hash table facilitating a clever table-lookup instead of using the hash iterator or another DoW-Loop.If this is not possible I am glad to hear why not.

Re: Aggregation using hashing: Repeat summarized results without the use of a hash iterator

ChrisNZ — Sat, 21 Mar 2020 02:06:52 GMT

Hash tables can only access one entry at at time.

In order to calculate sum, average. etc, you need to read all these values one by one. So you need to iterate somehow.

What's bothering you?

Note that this code can be made much clearer. For example:

proc sql ; 
   create table WANT as 
   select SMOKING_STATUS, BP_STATUS, SEX, SYSTOLIC, STATUS
        , count(unique SYSTOLIC) as _UNQ
        , sum(SYSTOLIC)          as _SUM
        , mean(SYSTOLIC)         as _MEAN
        , sum(STATUS='Alive')    as _SUM_ALV
        , count(*)               as _N
   from HAVE
   group by SMOKING_STATUS, BP_STATUS, SEX
   order by SMOKING_STATUS, BP_STATUS, SEX;
quit;

You can use the options _method and magic to monitor and (hope to) influence the SQL parser

Re: Aggregation using hashing: Repeat summarized results without the use of a hash iterator

RichardDeVen — Sat, 21 Mar 2020 02:38:45 GMT

There is no 'clever lookup'. In order to output the aggregates, associated with the details, you need to loop over the details.

A better long term solution might be to use an SQL query that is more clearly understood by more people than those that can decipher the hashing code.

Example of SQL solving the same problem. This query relies on the SAS Proc SQL feature that automatically re-merges summary statistics back with the original data.

data have;
  call streaminit(123);

  do response = 1 to 1e3;
    category1 = ceil(rand('uniform', 10));
    category2 = ceil(rand('uniform', 5));
    category3 = ceil(rand('uniform', 2));

    continuous_measurement = 100 + floor(rand('uniform', 100));
    binary_categorical_response = rand('uniform') < 0.09;

    output;
  end;
run;

proc sql;
  create table want as
  select
    category1, category2, category3, continuous_measurement, binary_categorical_response
    , count(distinct continuous_measurement) as _unq
    , sum(continuous_measurement) as _sum
    , mean(continuous_measurement) as _mean format = 5.1
    , sum(binary_categorical_response) as _count_binary_assert_1
    , count(*) as _n
  from
    have
  group by 
    category1, category2, category3
  order by
    category1, category2, category3, continuous_measurement desc ;