BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
left
Obsidian | Level 7

Being new to the concept of "hashing" I want to achieve the following with the dataset [sashelp.heart]: 

  1. Define the by-group/ key-variables as [smoking_status, bp_status, sex]
  2. Aggregate a numeric variable [systolic]
  3. Count the number of items in each by-group/ key in new variable [_n]
  4. Count the number of distinct values for [systolic] within each by-group/ key in new variable [_unq]
  5. Additionally: Get the number of patients alive from dichotomous character variable [Status], i.e. Status="Alive" for each by-group/ key in new variable [_n_alive]

 

What I already have thanks to @hashman and @DonH (Data Aggregation using the SAS Hash Object) is points 1 - 4 with the following:

 

<-- Load the sample data -->

 

data _heart_src;
    set sashelp.heart (keep=smoking_status bp_status sex systolic status);
    label   sex="Sex"
            status="Survival status"
            systolic="Systolic";
run;

 

Dataset at startDataset at start

<-- Aggregation using hashing -->

 

 

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum=sum(_sum, systolic);
/* 16 */         _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;
/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)";
/* 30 */ run;

 Note: I added line numbers for later references (if needed).

 

 

This results in:

hashing_aggregated_result.png

 

I don't know how to implement point 5 (number of patients alive for each by-group/ key).

I think another hash table and maybe a hash iterator is needed. 

 

I can solve point 5 with a DoW-Loop (no hashing, previous sorting necessary). I guess my code could surely be optimized (please add your comments). Maybe counting variables could be handled with "suminc" and *.sum method? 

 

Another shortcoming in my understanding is: When is the hash table [out] populated as no dataset is directly assigned in the definition? At the moment the first [out.find()] method is called? Is it then, that the whole dataset [_heart_src] is loaded? 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

@left Hi and welcome to the SAS Communities 🙂

 

First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.

 

As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.

 

It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.

View solution in original post

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

@left Hi and welcome to the SAS Communities 🙂

 

First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.

 

As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.

 

It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.

left
Obsidian | Level 7

@PeterClemmensenThank you for your quick reply and further explanations! 

 

It seems to me as a mighty tool. I guess it needs time to get acquainted with all the uses and how things work in concert. 

 

So just to have the final solution fixed (added new variable [_sum_alv] to hash object in line 05 and added new code as line 20a):

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum=sum(_sum, systolic);
/* 16 */         _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;

/* 20a */        if lowcase(status)="alive" then _sum_alv=sum(_sum_alv, 1);

/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)";
/* 30 */ run;
FreelanceReinh
Jade | Level 19

Hello @left,

 

First of all, thank you for creating your post in exemplary style.

 

To implement point 5 you can simply

  1. insert "_n_alive" into the list of data items in line 5,
  2. insert _n_alive into the argument of CALL MISSING in line 14,
  3. insert a sum statement like
    _n_alive+(status='Alive');
    between lines 16 and 17,
  4. add a label for _n_alive in the label statement after line 29 (and move the semicolon accordingly).

 

Your code is good. To ensure that the DECLARE statements are executed only once, I would insert a STOP statement at the end of the DATA step (before the RUN statement). As a minor simplification you can replace the assignments of the form "var=sum(var, increment);" by sum statements of the form "var+increment;" (as suggested for _n_alive above).

 

Using suminc would be possible, but I don't see a benefit in this case.

PeterClemmensen
Tourmaline | Level 20

As an additional comment, the great Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study by the same two authors you mention has an entire chapter on aggretates. 

hashman
Ammonite | Level 13

@left : Just add _n_alive to the data portion of OUT and and CALL MISSING and also add: 

 

 _n_alive = sum (_n_alive, Status =: "A") ;

 in the same vein as _n and _sum.

 

Kind regards

Paul D.

left
Obsidian | Level 7

Thank you for all of your feedback, @PeterClemmensen ,@FreelanceReinh and @hashman

 

I've updated the code, but would like to give one more enhancement to it: Instead of having only the summarized data (n=36 by-groups) I want to create the dataset with the full detail information along with the summarized data without creating another datastep. 

Can it be done without passing through the data once again by creating further hash tables? 

 

What I think how it should work (but cannot code it):

  1. Create another hash object [detail] and add entries from first DoW-Loop
  2. Aggregated data are populated in hash table [out] (already implemented).
  3. Update hash table [detail] using [multidata: "yes"] with matching keys from [out] with [detail.ref()].
        dcl hash detail (ordered: "a", multidata: "y");
        detail.defineKey("smoking_status", "bp_status", "sex");
        detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status", 
                          "_sum", "_mean", "_usys", "_n", "_sum_alv");
        detail.defineDone();
  4. Output of hash table [detail]

 

So using pictures - this is what I have:

HaveHave

 

This is what I want:

WantWant

 

Updated code (what I have so far)

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum+systolic;             * _sum=sum(_sum, systolic);
/* 16 */         _n+1;                      * _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq+1;                * _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;

/* 20a */        _sum_alv=sum(_sum_alv, status =: "A");     * _sum_alv+(status =: "A");

/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)"

/* 29a */           _sum_alv="Patients alive";

/* 30 */ run;

So my questions:

  • How can this be done with hashing?
  • If yes, is a further DoW-Loop required or can it be done in a single run? 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2195 views
  • 6 likes
  • 4 in conversation