Being new to the concept of "hashing" I want to achieve the following with the dataset [sashelp.heart]:
What I already have thanks to @hashman and @DonH (Data Aggregation using the SAS Hash Object) is points 1 - 4 with the following:
<-- Load the sample data -->
data _heart_src;
set sashelp.heart (keep=smoking_status bp_status sex systolic status);
label sex="Sex"
status="Survival status"
systolic="Systolic";
run;
<-- Aggregation using hashing -->
/* 01 */ data _null_;
/* 02 */ declare hash out(ordered: "a");
/* 03 */ out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */ out.defineData("smoking_status", "bp_status", "sex",
/* 05 */ "_sum", "_unq", "_n", "_mean");
/* 06 */ out.defineDone();
/* 07 */
/* 08 */ declare hash unq();
/* 09 */ unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */ unq.defineDone();
/* 11 */
/* 12 */ do until (end);
/* 13 */ set _heart_src end=end;
/* 14 */ if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */ _sum=sum(_sum, systolic);
/* 16 */ _n=sum(_n, 1);
/* 17 */ if (unq.check() ne 0) then do;
/* 18 */ _unq=sum(_unq, 1);
/* 19 */ unq.add();
/* 20 */ end;
/* 21 */ _mean=_sum/_n;
/* 22 */ out.replace();
/* 23 */ end;
/* 24 */ out.output(DATASET: "_hash_agg");
/* 25 */ format _mean 8.1;
/* 26 */ label _mean="Mean (systolic)"
/* 27 */ _sum="Sum (systolic)"
/* 28 */ _unq="Unique values (systolic)"
/* 29 */ _n="N (# of subjects)";
/* 30 */ run;
Note: I added line numbers for later references (if needed).
This results in:
I don't know how to implement point 5 (number of patients alive for each by-group/ key).
I think another hash table and maybe a hash iterator is needed.
I can solve point 5 with a DoW-Loop (no hashing, previous sorting necessary). I guess my code could surely be optimized (please add your comments). Maybe counting variables could be handled with "suminc" and *.sum method?
Another shortcoming in my understanding is: When is the hash table [out] populated as no dataset is directly assigned in the definition? At the moment the first [out.find()] method is called? Is it then, that the whole dataset [_heart_src] is loaded?
@left Hi and welcome to the SAS Communities 🙂
First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.
As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.
It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.
@left Hi and welcome to the SAS Communities 🙂
First off, I like your approach. This is a nice task to get a grasp of the hash object and it seems you're well on your way.
As to your question on when the hash object out is populated.. This happens in the out.replace() line. Not on the first find() call. As the doc says: The Replace method "Replaces the data that is associated with the specified key with new data.". Read more in the Replace Method Documentation.
It seems that you have found a way to solve point 5. Though you would still have to add the new variable to the hash object.
@PeterClemmensenThank you for your quick reply and further explanations!
It seems to me as a mighty tool. I guess it needs time to get acquainted with all the uses and how things work in concert.
So just to have the final solution fixed (added new variable [_sum_alv] to hash object in line 05 and added new code as line 20a):
/* 01 */ data _null_;
/* 02 */ declare hash out(ordered: "a");
/* 03 */ out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */ out.defineData("smoking_status", "bp_status", "sex",
/* 05 */ "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */ out.defineDone();
/* 07 */
/* 08 */ declare hash unq();
/* 09 */ unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */ unq.defineDone();
/* 11 */
/* 12 */ do until (end);
/* 13 */ set _heart_src end=end;
/* 14 */ if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */ _sum=sum(_sum, systolic);
/* 16 */ _n=sum(_n, 1);
/* 17 */ if (unq.check() ne 0) then do;
/* 18 */ _unq=sum(_unq, 1);
/* 19 */ unq.add();
/* 20 */ end;
/* 20a */ if lowcase(status)="alive" then _sum_alv=sum(_sum_alv, 1);
/* 21 */ _mean=_sum/_n;
/* 22 */ out.replace();
/* 23 */ end;
/* 24 */ out.output(DATASET: "_hash_agg");
/* 25 */ format _mean 8.1;
/* 26 */ label _mean="Mean (systolic)"
/* 27 */ _sum="Sum (systolic)"
/* 28 */ _unq="Unique values (systolic)"
/* 29 */ _n="N (# of subjects)";
/* 30 */ run;
Hello @left,
First of all, thank you for creating your post in exemplary style.
To implement point 5 you can simply
_n_alive+(status='Alive');
between lines 16 and 17,
Your code is good. To ensure that the DECLARE statements are executed only once, I would insert a STOP statement at the end of the DATA step (before the RUN statement). As a minor simplification you can replace the assignments of the form "var=sum(var, increment);" by sum statements of the form "var+increment;" (as suggested for _n_alive above).
Using suminc would be possible, but I don't see a benefit in this case.
As an additional comment, the great Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study by the same two authors you mention has an entire chapter on aggretates.
@left : Just add _n_alive to the data portion of OUT and and CALL MISSING and also add:
_n_alive = sum (_n_alive, Status =: "A") ;
in the same vein as _n and _sum.
Kind regards
Paul D.
Thank you for all of your feedback, @PeterClemmensen ,@FreelanceReinh and @hashman!
I've updated the code, but would like to give one more enhancement to it: Instead of having only the summarized data (n=36 by-groups) I want to create the dataset with the full detail information along with the summarized data without creating another datastep.
Can it be done without passing through the data once again by creating further hash tables?
What I think how it should work (but cannot code it):
dcl hash detail (ordered: "a", multidata: "y");
detail.defineKey("smoking_status", "bp_status", "sex");
detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status",
"_sum", "_mean", "_usys", "_n", "_sum_alv");
detail.defineDone();
So using pictures - this is what I have:
This is what I want:
Updated code (what I have so far)
/* 01 */ data _null_;
/* 02 */ declare hash out(ordered: "a");
/* 03 */ out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */ out.defineData("smoking_status", "bp_status", "sex",
/* 05 */ "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */ out.defineDone();
/* 07 */
/* 08 */ declare hash unq();
/* 09 */ unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */ unq.defineDone();
/* 11 */
/* 12 */ do until (end);
/* 13 */ set _heart_src end=end;
/* 14 */ if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */ _sum+systolic; * _sum=sum(_sum, systolic);
/* 16 */ _n+1; * _n=sum(_n, 1);
/* 17 */ if (unq.check() ne 0) then do;
/* 18 */ _unq+1; * _unq=sum(_unq, 1);
/* 19 */ unq.add();
/* 20 */ end;
/* 20a */ _sum_alv=sum(_sum_alv, status =: "A"); * _sum_alv+(status =: "A");
/* 21 */ _mean=_sum/_n;
/* 22 */ out.replace();
/* 23 */ end;
/* 24 */ out.output(DATASET: "_hash_agg");
/* 25 */ format _mean 8.1;
/* 26 */ label _mean="Mean (systolic)"
/* 27 */ _sum="Sum (systolic)"
/* 28 */ _unq="Unique values (systolic)"
/* 29 */ _n="N (# of subjects)"
/* 29a */ _sum_alv="Patients alive";
/* 30 */ run;
So my questions:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.