Solved: Aggregation using hashing: counting distinct occurrences based on a st...

left · Posted 03-17-2020 11:19 AM

Being new to the concept of "hashing" I want to achieve the following with the dataset [sashelp.heart]:

Define the by-group/ key-variables as [smoking_status, bp_status, sex]
Aggregate a numeric variable [systolic]
Count the number of items in each by-group/ key in new variable [_n]
Count the number of distinct values for [systolic] within each by-group/ key in new variable [_unq]
Additionally: Get the number of patients alive from dichotomous character variable [Status], i.e. Status="Alive" for each by-group/ key in new variable [_n_alive]

What I already have thanks to @hashman and @DonH (Data Aggregation using the SAS Hash Object) is points 1 - 4 with the following:

<-- Load the sample data -->

data _heart_src;
    set sashelp.heart (keep=smoking_status bp_status sex systolic status);
    label   sex="Sex"
            status="Survival status"
            systolic="Systolic";
run;

Dataset at start

<-- Aggregation using hashing -->

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum=sum(_sum, systolic);
/* 16 */         _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;
/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)";
/* 30 */ run;

Note: I added line numbers for later references (if needed).

This results in:

I don't know how to implement point 5 (number of patients alive for each by-group/ key).

I think another hash table and maybe a hash iterator is needed.

I can solve point 5 with a DoW-Loop (no hashing, previous sorting necessary). I guess my code could surely be optimized (please add your comments). Maybe counting variables could be handled with "suminc" and *.sum method?

Another shortcoming in my understanding is: When is the hash table [out] populated as no dataset is directly assigned in the definition? At the moment the first [out.find()] method is called? Is it then, that the whole dataset [_heart_src] is loaded?

PeterClemmensen · Posted 03-17-2020 12:09 PM

@left Hi and welcome to the SAS Communities 🙂

First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.

As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.

It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 03-17-2020 12:09 PM

@left Hi and welcome to the SAS Communities 🙂

First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.

As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.

It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.

The DATA to DATA Step Macro
Blog: SASnrd

left · Posted 03-17-2020 12:24 PM

@PeterClemmensenThank you for your quick reply and further explanations!

It seems to me as a mighty tool. I guess it needs time to get acquainted with all the uses and how things work in concert.

So just to have the final solution fixed (added new variable [_sum_alv] to hash object in line 05 and added new code as line 20a):

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum=sum(_sum, systolic);
/* 16 */         _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;

/* 20a */        if lowcase(status)="alive" then _sum_alv=sum(_sum_alv, 1);

/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)";
/* 30 */ run;

FreelanceReinh · Posted 03-17-2020 12:37 PM

Hello @left,

First of all, thank you for creating your post in exemplary style.

To implement point 5 you can simply

insert "_n_alive" into the list of data items in line 5,
insert _n_alive into the argument of CALL MISSING in line 14,
insert a sum statement like
```
_n_alive+(status='Alive');
```
between lines 16 and 17,
add a label for _n_alive in the label statement after line 29 (and move the semicolon accordingly).

Your code is good. To ensure that the DECLARE statements are executed only once, I would insert a STOP statement at the end of the DATA step (before the RUN statement). As a minor simplification you can replace the assignments of the form "var=sum(var, increment);" by sum statements of the form "var+increment;" (as suggested for _n_alive above).

Using suminc would be possible, but I don't see a benefit in this case.

PeterClemmensen · Posted 03-17-2020 12:49 PM

As an additional comment, the great Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study by the same two authors you mention has an entire chapter on aggretates.

The DATA to DATA Step Macro
Blog: SASnrd

hashman · Posted 03-17-2020 01:42 PM

@left : Just add _n_alive to the data portion of OUT and and CALL MISSING and also add:

 _n_alive = sum (_n_alive, Status =: "A") ;

in the same vein as _n and _sum.

Kind regards

Paul D.

left · Posted 03-18-2020 06:39 AM

Thank you for all of your feedback, @PeterClemmensen ,@FreelanceReinh and @hashman!

I've updated the code, but would like to give one more enhancement to it: Instead of having only the summarized data (n=36 by-groups) I want to create the dataset with the full detail information along with the summarized data without creating another datastep.

Can it be done without passing through the data once again by creating further hash tables?

What I think how it should work (but cannot code it):

Create another hash object [detail] and add entries from first DoW-Loop
Aggregated data are populated in hash table [out] (already implemented).

Update hash table [detail] using [multidata: "yes"] with matching keys from [out] with [detail.ref()].

    dcl hash detail (ordered: "a", multidata: "y");
    detail.defineKey("smoking_status", "bp_status", "sex");
    detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status", 
                      "_sum", "_mean", "_usys", "_n", "_sum_alv");
    detail.defineDone();

Output of hash table [detail]

So using pictures - this is what I have:

Have

This is what I want:

Want

Updated code (what I have so far)

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum+systolic;             * _sum=sum(_sum, systolic);
/* 16 */         _n+1;                      * _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq+1;                * _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;

/* 20a */        _sum_alv=sum(_sum_alv, status =: "A");     * _sum_alv+(status =: "A");

/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)"

/* 29a */           _sum_alv="Patients alive";

/* 30 */ run;

So my questions:

How can this be done with hashing?
If yes, is a further DoW-Loop required or can it be done in a single run?

Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Re: Aggregation using hashing: counting distinct occurrences based on a string

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!