Solved: Re: Keeping only highest values in hash table - Page 2

Haikuo · Posted 12-19-2013 02:59 PM

I was quite puzzled by your comments regarding on "I/O" issues until I saw this post of yours. I was thinking OP wants either 'nice' or 'nicer', but not both, hence my solution. If wanting both, I would opt using my /*Want2*/ with a little tweak, such as using removedup() instead of remove() method, and then do a subset using "obs=3" to get /*want1*/, paying little to zero extra to 'I/O', but the benefit is obvious: 1. Avoiding strenuously Hash coding with multiple Hash objects. 2. Remove() method is not quite ready for dup-keys in this case, try following data you will know what I mean.

Data have;

Input Name $ Score;

Datalines;

A.B. 11

C.D. 9

E.F. 5

G.H. 3

M.N 3

I.J. 10

Run;

Haikuo

Vince28_Statcan · Posted 12-20-2013 11:58 AM

I did indeed simply assume that he might want to achieve both and not either or hence programming both within a single data step to save what is the longest processing element which is to input the data through the set statement.

I don't think the output method supports dataset option like OBS=3 for output so you'd need to replace data _null_ with data want1 and use the iterator on the 2nd hash object to just retrieve the top three values. In fact, this would also solve the bug you've pointed out with my want1 output. However, I still do think there's a lot fo learn playing with hashing around and the 2-hash solution for want2. It also solves the bug pointed out by your own dataset for your want2 output and detailed below.

The issue with not using double hash object for WANT2 is that there does not exist a num_key property to hash objects for multidata. The data set you've examplified above also fails for your want2 because the threshold to adding new keys is _N_=3 so if there are only 1 or 2 keys across the first 3 lines of the input dataset, the output will only have one or two keys as well (and however many records share these keys). Still, it means that you have to consider the odd cases of your data in your "initialization phase" of the object. For most scenarios andi n particualr the one examplified here it is fairly simple, you can just create a new counter variable and increment it if find() fails on the hash prior to addition and then repalce the _N_ by that new counter variable instead.

The main reason I went with other approach was that it is completely independant from initialization. It uses NUM_ITEMS on the "KT" (key table) hash as a mean to mimic a missing NUM_KEYS property on the object I really care for. If NUM_KEYS property existed, it could've been done in a single hash as well but still using the sorting power of the hash to avoid having to manually code a bunch of additionnal variables to keep track of what to add/remove. You simply systematically add to your hash object and remove the last key (and all data associated to the key) if the NUM_KEYS is above your set threshold. This is also one of the beauty of hash object if you ever have to manipulate timestamp data from a system (like a voice recognition system or electronic questionaire or transaction records etc). You can use hash of hashes to rebuild the sequence of events regardless of any existing sortation and the internal instances of the hashes act as BY-groups sorted by the timestamp so you can iterate through and go back and forth if need be for decision logic and output a single row all within a single data step instead of a couple sorts.

Anyway I'm digressing because I did have to work with odd sortation masive call system data not so long ago where if I had had the hash knowledge I have now would've made my life so much easier (and I still have a lot of twists to learn).

But for all I know, even though the hash object has significantly improved in 9.2 from all documentation I could find, it is still missing quite a few tools.

1. A way to clear the hash iterator pointer so as to not lock remove() from happening

2. A hash iterator tool that allows to remove the current element (not the entire key but just the data element - similar to removedup but less tedious to use - it would also be extremely useful for traversing through the hash object and removing rows after sortation or other constructs have happened)

3. A NUM_KEYS property

Some might exist and are undocumented but I haven't found any paper discussing any such issues.

Nonetheless, very interesting discussion so far

Georg_UPB · Posted 12-22-2013 08:25 PM

Turned out that we actually needed just the n highest scores ("want2"), but we needed them for every name (or ID). This is the solution I came up with. Unfortunately, it needs not one but two additional tables, but it's working and it's really fast... However, any suggestions are always welcome!

Thanks again for your help!

%Let n_highest=3;

Data have;

Input ID $ Score;

Datalines;

A.B. 10

A.B. 5

A.B. 8

A.B. 10

A.B. 7

K.L. 9

K.L. 2

K.L. 9

Run;

Data _null_;

If 0 Then Set have;

If _n_=1 Then Do;

Declare Hash extra_tab1(Ordered:'a');

extra_tab1.DefineKey('ID');

extra_tab1.DefineData('ID','lowest','no');

extra_tab1.DefineDone();

Declare Hash extra_tab2(Ordered:'a');

extra_tab2.DefineKey('ID','Score');

extra_tab2.DefineData('ID','Score');

extra_tab2.DefineDone();

Declare Hiter extra_tab2_hit('extra_tab2');

Declare Hash tab(Ordered:'a',Multidata:'y');

tab.DefineKey('ID','Score');

tab.DefineData('ID','Score');

tab.DefineDone();

End;

Set have End=Done;

If extra_tab1.Find() ne 0 Then Do;

no=1;

lowest=Score;

extra_tab1.Add();

extra_tab2.Add();

tab.Add();

End;

Else If no<&n_highest. Then Do;

If extra_tab2.Find() ne 0 Then Do;

no=no+1;

lowest=Min(lowest,Score);

extra_tab1.Replace();

extra_tab2.Add();

End;

tab.Add();

End;

Else If Score>=lowest Then Do;

tab.Add();

If extra_tab2.Find() ne 0 Then Do;

extra_tab2.Add();

extra_tab2_hit.SetCur(Key:ID,Key:lowest);

extra_tab2_hit.Next();

_lowest_new=Score;

extra_tab2.Remove(Key:ID,Key:lowest);

tab.Remove(Key:ID,Key:lowest);

lowest=_lowest_new;

extra_tab1.Replace();

End;

If Done Then Do;

extra_tab1.Output(Dataset:"extra_tab1");

extra_tab2.Output(Dataset:"extra_tab2");

tab.Output(Dataset:"want2");

End;

Run;

Vince28_Statcan · Posted 12-23-2013 12:25 PM

I believe it could be reduced in #hash, #of statements and logic by using hash of hashes but it adds an additionnal layer of complexity for code transferability (e.g. you leave your current position and someone has to pickup from it, if he's not already familliar with the hash object he will struggle far more figuring out what the code does with hash of hashes than in the current implementation).

I'll gladly take a stab at it and provide the code if you request it. But otherwise I'll assume it is not needed. I don't think it would be significantly faster than the current implementation either.

I was actually amazed when I first read Black Belt Hashigana by Paul Dorfman about hash of hashes.

Vince

*edit* In practice, there would actually be n+1 hash objects where n is the number of distinct IDs but programatically speaking, they would just be new instances of a single hash so there would only need to be 2 hash objects defined. One with the current approach of instanciating upon declaration (followed with (), empty or with options like dataset: or sorted: ). The other ones would be built through a loop. The conditionnal logic would also be simpler with hash of hashes. Only the gist of understanding how multiple instances of a hash object all with the same name are handled is what would make the code difficult to understand.

*edit* Furthermore, in the perspective of wanting to use hash objects with complex conditional searches and/or with reduced memory usage, hash of hashes allow you to keep a by-variable only once in memory instead of however many multidata rows exist. It is really only worthwhile if you have big multidata groups and/or may want to search according to different keys based on the first key search success.

Georg_UPB · Posted 12-23-2013 09:49 PM

I'm always on the look out and would highly appreciate to see another solution!

Vince28_Statcan · Posted 12-24-2013 09:25 AM

Well, I thought this would take me forever but I somehow nailed it on the first pass although I might be overlooking something. Anyway here's the solution - sorry for the copy/paste cluter. This may not be the absolute best case to demonstrate hash of hashes but nonetheless it worked out alright.

In the event that your data is already sorted by ID but not subsequently by SCORE (as then using a first. logic and retained counter would be far more efficient than hashing), it is possible to simplify it a lot by reusing constantly the same inner/hinner and using the .clear() or .delete() and a new declare for each by group (can be managed with first. and last.). It would obviously significantly reduce memory consumption since you only keep a small subset of data in hash and output it through a loop with the hiterator upon last.ID type of logic.

As for the rationale why it can be worthwhile to learn hashing hashes, if you have data with many data points where a large subset of the variables retain the same values, by using hash of hashes you actually allocate memory only once for all the data in the outer hash and then only the varying portion however small n times in the inner hash object. So for instance, if you had pacemaker data (bear with me here I don't work with health statistics but it's the first thing that came up to mind) with millions of timestamps with some data elements about vital signs provided by the pacemaker at each such timestamp but then you needed to carry tons of invariants or little variants (Age sex marital s tatus working status etc.) as well as some slightly more variant data like medical visits and the additional health measuers done at each such visits. Well you could play with 3 hashes, the outer having the age/sex/etc, the middle one with the medical visits data and ther inner one with timestamps+pacemaker data. So you might have 1M timestamps total for a single person but you only use memory once for all of his invariants, say 12 times (once a year) for his little variants and then the 1M timestamps with just a few variables taking up bulk memory.

On top of that, this allows you to do some funky data manipulation as you have 3 layers of searches. For instance, if you have twins that have a specific twin_id, you might be interested in adding a 4rth layer of hashing with just twin ID as key and data and loop over twin pairs to track events and then use the fourth (inner most) hash object to see if the twin had a similar event shortly before or shortly after the timestamp where it occured on the first twin. Anyway I'm digressing here but even though they are quite complex to code and especially hard to transfer or exchange with colleagues as Hashing is still foreign grounds for most SAS end-users, there are many niches left to be explored where hash objects and in particular the capacity to store other objects as data in hash objects can significantly improve quality of life.

%Let n_highest=3;

%let hashexp=%sysfunc(ceil(%sysfunc(sqrt(&n_highest.))));

Data have;

Input ID $ Score;

Datalines;

A.B. 10

A.B. 5

A.B. 8

A.B. 10

A.B. 7

A.B. 23

A.B. 10

K.L. 9

K.L. 12

K.L. 11

K.L. 2

K.L. 9

K.L. 7

;

Run;

data want2;
length id $8. score 8. counter 8.;
If _n_=1 Then Do;
declare hash inner; /* not yet instanciated */
declare hiter hinner; /* not yet instanciated */

        declare hash outer(Ordered:'a', multidata: 'n'); /* load*/
        declare hiter houter('outer'); /* hash iterator object declared on hash object HT */
        outer.DefineKey('ID');
        outer.DefineData('ID','inner', 'hinner');
        outer.DefineDone();
end;

set have end=last;

if outer.find() NE 0 then do; /* ID does not exist, create its own new hash object and iterator to track scores */
inner = _new_ hash(ordered: 'y', multidata: 'n', hashexp: &hashexp.); /* instanciating / multidata:'n' is default but for clarity, we use the counter variable instead of an additional hash to mimic NUM_KEYS attribute */
hinner = _new_ hiter('inner'); /* instanciating */
inner.definekey('score');
inner.definedata('score', 'counter');
inner.definedone();

outer.add(); /* add the ID and the related inner objects to the outer hash object */
counter=1; /* Initiate inner counter variable */
inner.add(); /* add the score and counter to the inner object */
end; else do; /* Else, an inner object already exists for this ID */
if inner.find()=0 then do; /* If the score exists, increment its counter variable and replace. Otherwise, add it and handle the possibility of 4 distinct score now existing in the object */
   counter=counter+1;
   inner.replace();
end; else do;
   counter=1;
   inner.add();
   if inner.num_items>&n_highest. then do;
    hinner.first(); /* set pointer to first aka lowest score due to sorting order */
    rc=hinner.prev(); /* Clear the pointer in an idiotic way so that lowest key can be removed - this way inner hiterators will also have null pointers for the output at the end so noneed to do special handling of .first() instead of .next() to reinitialize */
    rc=inner.remove();
   end;
end;
end;

/* hashes logic is built, now we need a method to output all of this data. */

if last then do;
do while(houter.next()=0); /* loop on all IDs */
   do while(hinner.next()=0); /* loop on all scores for that ID*/
    do i=1 to counter; /* use our counter variable to recreate the appropriate number of records for an ID/Score pair */
     output;
    end;
   end;
end;
end;

drop rc i counter;
run;

Cheers and wish you great holidays!

Vince

Georg_UPB · Posted 12-27-2013 11:47 AM

I really do appreciate your help and time! Once again a learnt a lot (using hash of hashes, clearing the pointer) and got some helpful suggestions. Also, hash of hashes allows for certain columns to be sorted in different orders (e.g., having output table sorted by "ID" ascendingly and by "Score" descendingly).

Enjoy your holidays as well!

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Register Today!

SAS Training: Just a Click Away