Hashing: Left join on subsets of identical dataset (improve working example)

left — Wed, 24 Jun 2020 09:47:46 GMT

Being still novel to hashing I wonder if anyone can make my attempt for a left-join more elegant and/ or efficient.

Here is what I have (first generating sample data as preparation called [have_source], then calculate needed structure in dataset [have]) :

* Generate sample data;
data have_source (drop=i);
    do id=1 to 1000;
        do i=65 to 65+6;                            * Generate letters A-G;
            do arefpce=1 to 9;                      * [arefpce] = episode counter (episode can be x days);
                paramcd=byte(i);
                aval=rand("integer", 0, 15);        * Random number of events within one episode of type "paramcd";
                output;
            end;
        end;
    end;
run;

Create needed structure using data [have_source]:

data have (drop= aval_:);
    do until (last.paramcd);
        set have_source;
        by id paramcd arefpce;
        * Summarize number of events in episodes 2-3;
        if (arefpce >= 2 AND arefpce <= 3) then do;
            aval_23=sum(aval_23, aval);
        end;
        * Summarize number of events in episodes 4-9;
        if (arefpce >= 4 AND arefpce <= 9) then do;
            aval_49=sum(aval_49, aval);
        end;
        * Summarize number of events in episodes 7-9;
        if (arefpce >= 7 AND arefpce <= 9) then do;
            aval_79=sum(aval_79, aval);
        end;
    end;
    * Print summarized results for each time frame ([arefpce]=2-3, 4-9, 7-9) separately to output;
    arefpce=203;
    aval=aval_23;
    output;
    arefpce=409;
    aval=aval_49;
    output;
    arefpce=709;
    aval=aval_79;
    output;
    * Run through dataset again to output "detail" data where calculations were based on;
    do until (last.paramcd);
        set have;
        by id paramcd arefpce;
        output;
    end;
run;

Sort it:

* Sort dataset to have the detail data before the summarized data;
proc sort data=have;
    by id paramcd arefpce;
run;

The first few lines look like this:

Dataset "have" (first few lines)

------------------

Now the "want" part: A left join using subsets of the [have] dataset.

First the SQL solution (serving as reference):

proc sql;
    create table want_sql as
           select l.id
                , l.paramcd
                , l.aval    as aval23
                , r49.aval  as aval49
                , r79.aval  as aval79
           from           have (where=(arefpce eq 203)) as l
                left join have (where=(arefpce eq 409)) as r49 on l.id eq r49.id and l.paramcd eq r49.paramcd
                left join have (where=(arefpce eq 709)) as r79 on l.id eq r79.id and l.paramcd eq r79.paramcd;
quit;

The same using base SAS:

data want_base (drop=arefpce:);
    merge have (where=(arefpce_23 = 203) rename=(aval=aval_23 arefpce=arefpce_23) in=h_23)
          have (where=(arefpce_49 = 409) rename=(aval=aval_49 arefpce=arefpce_49) in=h_49)
          have (where=(arefpce_79 = 709) rename=(aval=aval_79 arefpce=arefpce_79) in=h_79);
    by id paramcd;
    if h_23;
run;

The output looks like this:

Dataset [want] (first few lines)

--------------------------------

My attempt on hashing works, but is lacking clarity/ efficiency and also not so concise than I imagine it could be:

* Want --> Left join using hashing (not SQL) in an elegant and/ or efficient manner;
data _null_;
    if 0 then set have (keep=id paramcd /*aval*/)
                  have (keep=id aval rename=(aval=aval23))
                  have (keep=id aval rename=(aval=aval49))
                  have (keep=id aval rename=(aval=aval79));
    length table  $41
           _where $200;
    if _n_=1 then do;
        declare hash h_out(ordered: "yes");
        h_out.defineKey("id", "paramcd");
        h_out.defineData("id", "paramcd", "aval23", "aval49", "aval79");
        h_out.defineDone();

        table="have";
        _where="(where=(arefpce=203) rename=(aval=aval23))";
        declare hash h23(dataset: cats(table, _where));
        h23.defineKey("id", "paramcd");
        h23.defineData("aval23");
        h23.defineDone();

        _where="(where=(arefpce=409) rename=(aval=aval49))";
        declare hash h49(dataset: cats(table, _where));
        h49.defineKey("id", "paramcd");
        h49.defineData("aval49");
        h49.defineDone();

        _where="(where=(arefpce=709) rename=(aval=aval79))";
        declare hash h79(dataset: cats(table, _where));
        h79.defineKey("id", "paramcd");
        h79.defineData("aval79");
        h79.defineDone();

        call missing(aval23, aval49, aval79);
    end;
    do until (done);
        set have (where=(arefpce=203)) end=done;
        rc=h23.find();
        if (rc=0) then h_out.replace();
        if (rc=0 AND h49.find()=0) then h_out.replace();
        else do;
            call missing(aval49);
            h_out.replace();
        end;
        if (rc=0 AND h79.find()=0) then h_out.replace();
        else do;
            call missing(aval79);
            h_out.replace();
        end;
    end;
    h_out.output(dataset: "want");
run;

Obviously solutions can emphasize several aspects (brevity, readability, flexibility, elegance, etc.) and therefore be rather cryptic, verbatim, more complex (hash of hashes), etc. So I'm happy to learn some more options on this example.

Re: Hashing: Left join on subsets of identical dataset (improve working example)

PeterClemmensen — Wed, 24 Jun 2020 11:30:40 GMT

Here is a simple hash "left join"

data want_hash (drop=rc);
    if _N_ = 1 then do;
        dcl hash h49 (dataset : "have(where=(arefpce=409) rename=(aval=aval49))");
        h49.defineKey("id", "paramcd");
        h49.defineData("aval49");
        h49.defineDone();
        dcl hash h79 (dataset : "have(where=(arefpce=709) rename=(aval=aval79))");
        h79.defineKey("id", "paramcd");
        h79.defineData("aval79");
        h79.defineDone();
    end;

    set have(where=(arefpce=203) rename=(aval=aval23));
    call missing (aval49, aval79);

    rc = h49.find();
    rc = h79.find();
run;

Re: Hashing: Left join on subsets of identical dataset (improve working example)

PeterClemmensen — Wed, 24 Jun 2020 11:45:49 GMT

Also, you could do:

data want_hash (drop=aval rc);

    dcl hash h49 ();
    h49.defineKey("id", "paramcd");
    h49.defineData("aval");
    h49.defineDone();
    dcl hash h79 ();
    h79.defineKey("id", "paramcd");
    h79.defineData("aval");
    h79.defineDone();

    do until (lr1);
        set have (where=(arefpce in (409, 709))) end=lr1 ;
        if   arefpce=409 then h49.add();
        else                  h79.add();
    end;

    do until (lr2);
        set have (where=(arefpce = 203)) end=lr2;
        aval23 = aval;
        rc = h49.find(); aval49 = aval;
        rc = h79.find(); aval79 = aval;
        output;
    end;

run;

Re: Hashing: Left join on subsets of identical dataset (improve working example)

s_lassen — Wed, 24 Jun 2020 11:48:08 GMT

You create a hash for the AVAL23 subset, but you also read the data directly. Why? And there is no reason to put the output in a hash. That is mostly used when you want to create a number of tables dynamically.

I do not think hashes are called for in this case, but if you must do it, just use them to get at the two additional tables in the left join:

data want;
  set have (where=(arefpce=203) rename=(aval=aval23));
  if _N_=1 then do;
    declare hash h49(dataset: 'have(where=(arefpce=409)');
    h49.defineKey("id", "paramcd");
    h49.defineData("aval");
    h49.defineDone();
    
    declare hash h79(dataset: 'have(where=(arefpce=709)');
    h79.defineKey("id", "paramcd");
    h79.defineData("aval");
    h79.defineDone();
    end;
  if h49.find()=0 then
    aval49=aval;
  if h79.find()=0 then
    aval79=aval;
  keep id paramcd aval23 aval49 aval79;
run;

topic Re: Hashing: Left join on subsets of identical dataset (improve working example) in SAS Programming

Hashing: Left join on subsets of identical dataset (improve working example)

Re: Hashing: Left join on subsets of identical dataset (improve working example)

Re: Hashing: Left join on subsets of identical dataset (improve working example)

Re: Hashing: Left join on subsets of identical dataset (improve working example)