BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
left
Obsidian | Level 7

This is a follow-up question on Aggregation using hashing: counting distinct occurrences based on a string (solved and helpful comments thanks to @draycut ,@FreelanceReinhard and @hashman!).

 

Data for this scenario:

 

data _heart_src;
    set sashelp.heart (keep=smoking_status bp_status sex systolic status);
    label   sex="Sex"
            status="Survival status"
            systolic="Systolic";
run;

 

 

I've updated the code for the hashing based solution (see Aggregation using hashing: counting distinct occurrences based on a string), but would like to give one more enhancement to it:

  • Instead of having only the summarized data (n=36 by-groups) I want to create a dataset with the full detail information from the input dataset together with the summarized data
  • Solution shall be done using hashing

Can it be done without passing through the data once again (no additional DoW-Loop), e.g. by only creating further hash tables? 

 

What I think how it should work (but cannot code it):

  1. Create another hash object [detail] and add entries from first DoW-Loop
  2. Aggregated data are populated in hash table [out] (already implemented).
  3. Update hash table [detail] using [multidata: "yes"] with matching keys from [out] with [detail.ref()].

 

    dcl hash detail (ordered: "a", multidata: "y");
    detail.defineKey("smoking_status", "bp_status", "sex");
    detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status", 
                      "_sum", "_mean", "_usys", "_n", "_sum_alv");
    detail.defineDone();

 

  • Output of hash table [detail]

 

So using pictures - this is what I have:

HaveHave

 

This is what I want:

WantWant

 

Updated code (what I have so far)

 

 

/* 01 */ data _null_;
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_sum", "_unq", "_sum_alv", "_n", "_mean");
/* 06 */     out.defineDone();
/* 07 */ 
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n);
/* 15 */         _sum+systolic;             * _sum=sum(_sum, systolic);
/* 16 */         _n+1;                      * _n=sum(_n, 1);
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq+1;                * _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;

/* 20a */        _sum_alv=sum(_sum_alv, status =: "A");     * _sum_alv+(status ="Alive");

/* 21 */         _mean=_sum/_n;
/* 22 */         out.replace();
/* 23 */     end;
/* 24 */     out.output(DATASET: "_hash_agg");
/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)"

/* 29a */           _sum_alv="Patients alive";

/* 30 */ run;

So my questions:

  • How can this be done with hashing?
  • If yes, can it be done in a single run (or is a further DoW-Loop required)?
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @left,

 

Your "have" and "want" pictures are currently a bit inconsistent: some statistics from "have" (e.g. _unc) do not occur in "want", some from "want" (e.g. sum_wt_norm) cannot be based on _heart_src and some variable names differ (_sum vs. sum_sys). Most importantly, your "updated code" contains a mistake: _sum_alv is not reset to missing in the CALL MISSING routine, resulting in incorrect values for this summary statistic. (Also the STOP statement is missing.)

 

That said, here's one way to combine detail rows and summary statistics using hash object (and hash iterator) methods:

  • Yes, you can define a multidata hash object "detail", but I wouldn't want to create much redundancy by copying all summary statistics to each of its entries. Instead, I would restrict the list of data items to the detail variables needed:
    dcl hash detail(ordered: "a", multidata: "y");
    detail.defineKey("smoking_status", "bp_status", "sex");
    detail.defineData("systolic", "status");
    detail.defineDone();
  • Hash object "detail" can be populated using the ADD method after the SET statement:
    detail.add();
  • A hash iterator object could be used to traverse the hash object containing the summary statistics.
    Declaration:
    dcl hiter iout('out');
    Usage (after the DoW loop, replacing the call of the OUTPUT method):
    do while(iout.next()=0);
      do while(detail.do_over()=0);
        output;
      end;
    end;
    Of course, the OUTPUT statement requires a change to the DATA statement:
    data want;

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hi @left,

 

Your "have" and "want" pictures are currently a bit inconsistent: some statistics from "have" (e.g. _unc) do not occur in "want", some from "want" (e.g. sum_wt_norm) cannot be based on _heart_src and some variable names differ (_sum vs. sum_sys). Most importantly, your "updated code" contains a mistake: _sum_alv is not reset to missing in the CALL MISSING routine, resulting in incorrect values for this summary statistic. (Also the STOP statement is missing.)

 

That said, here's one way to combine detail rows and summary statistics using hash object (and hash iterator) methods:

  • Yes, you can define a multidata hash object "detail", but I wouldn't want to create much redundancy by copying all summary statistics to each of its entries. Instead, I would restrict the list of data items to the detail variables needed:
    dcl hash detail(ordered: "a", multidata: "y");
    detail.defineKey("smoking_status", "bp_status", "sex");
    detail.defineData("systolic", "status");
    detail.defineDone();
  • Hash object "detail" can be populated using the ADD method after the SET statement:
    detail.add();
  • A hash iterator object could be used to traverse the hash object containing the summary statistics.
    Declaration:
    dcl hiter iout('out');
    Usage (after the DoW loop, replacing the call of the OUTPUT method):
    do while(iout.next()=0);
      do while(detail.do_over()=0);
        output;
      end;
    end;
    Of course, the OUTPUT statement requires a change to the DATA statement:
    data want;
left
Obsidian | Level 7

Thank you @FreelanceReinh for the corrections in my code and for providing a solution! My apologize for the inconsistencies in the above code and the screenshots. The second screenshot was from a "double DoW-Loop" solution with slightly different naming for the variables. 

 

I have updated the code, but used a slight variation of your solution due to the fact that I run SAS v9.2 which does not support the "do_over()" method. I intended to use another hash table for output instead of the dataset [want] just for the sake of finding my way.

 

/* 01 */ data _null_;
/* 01a */    * Hash table [out] for aggregated display (no duplicate keys);
/* 02 */     declare hash out(ordered: "a");
/* 03 */     out.defineKey ("smoking_status", "bp_status", "sex");
/* 04 */     out.defineData("smoking_status", "bp_status", "sex", 
/* 05 */                    "_unq", "_sum", "_mean", "_sum_alv", "_n");
/* 06 */     out.defineDone();
/* 07 */ 
/* 07a */    * Hash table [unq] used for counting unique occurrences of value [systolic] within the key of hash table [h];
/* 08 */     declare hash unq();
/* 09 */     unq.defineKey("smoking_status", "bp_status", "sex", "systolic");
/* 10 */     unq.defineDone();

/* 10a */    * Hash table [detail] for detailed data on patients status (alive/ dead) and systolic blood pressure;
/* 10b */    dcl hash detail (ordered: "a", multidata: "y");
/* 10c */    detail.defineKey("smoking_status", "bp_status", "sex");
/* 10d */    detail.defineData("smoking_status", "bp_status", "sex", "systolic", "status");
/* 10e */    dcl hiter h_detail ("detail");     * Iterator to be able to walk through detail hash table [detail];
/* 10f */    detail.defineDone();

/* 10g */    * Hash table [det_out] for output of detailed data (updated with aggregated values from hash table [h];
/* 10h */    dcl hash det_out(ordered: "a", multidata: "y");
/* 10i */    det_out.defineKey("smoking_status", "bp_status", "sex");
/* 10j */    det_out.defineData("smoking_status", "bp_status", "sex", "systolic", "_unq", "_sum", "_mean", "status", "_sum_alv", "_n");
/* 10k */    det_out.defineDone();
/* 11 */ 
/* 12 */     do until (end);
/* 13 */         set _heart_src end=end;
/* 13a */        detail.add();  * fill hash table [detail] for each observation;
/* 14 */         if (out.find() ne 0) then call missing(_sum, _unq, _n, _sum_alv);
/* 15 */         _sum+systolic;             * _sum=sum(_sum, systolic);
/* 16 */         _n+1;                      * _n=sum(_n, 1);
/* 16a */        _sum_alv=sum(_sum_alv, status =: "A");      * Count only patients alive;
/* 16b */        * Check if value is unique within its "by-group"/ key;
/* 17 */         if (unq.check() ne 0) then do;
/* 18 */             _unq+1;                * _unq=sum(_unq, 1);
/* 19 */             unq.add();
/* 20 */         end;
/* 21 */         _mean=_sum/_n;
/* 21a */        * Update hash table [h] (aggregated values only);
/* 22 */         out.replace();
/* 23 */     end;

/* 23a */    rc=h_detail.first();    * Place "pointer" to first item in hash table [detail];
/* 23b */    do while (rc=0);
/* 23c */        * Write data portion of hash table [h] into PDV with "h.find()" (for each matching key);
/* 23d */        if out.find()=0 then det_out.add();
/* 23e */        rc=h_detail.next();
/* 23f */    end;

/* 23g */    * Output of final hash tables;
/* 24 */     out.output    (dataset: "_hash_agg");      * [h]   aggravated dataset (1 entry for each key);
/* 24a */    det_out.output(dataset: "_hash_agg_all");  * [all] aggravated values for each observation (duplicate keys);

/* 25 */     format _mean 8.1;
/* 26 */     label  _mean="Mean (systolic)"
/* 27 */            _sum="Sum (systolic)"
/* 28 */            _unq="Unique values (systolic)"
/* 29 */            _n="N (# of subjects)"
/* 29a */           _sum_alv="Patients alive";
/* 29b */    stop;
/* 30 */ run;

 

The final output comes to this (to have the post complete):

Want (Aggregated data repeated for each observation)Want (Aggregated data repeated for each observation)

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 653 views
  • 0 likes
  • 2 in conversation