Solved: Re: Keeping only highest values in hash table

Georg_UPB · Posted 12-17-2013 11:32 AM

Hi everyone!

When going through dataset "have" shown in the example below I only want to keep the three highest scores:

"want1": three entries with the highest scores would be nice

"want2": entries with the three highest scores would be even nicer

As soon as I try to accomplish this my code gets pretty complicated and its not really working...

As I'm new to hash tables, I have a feeling that there is an easy solution that I'm just missing... Any help is appreciated!!

Data have;

Input Name $ Score;

Datalines;

A.B. 6

C.D. 10

E.F. 5

G.H. 3

I.J. 10

K.L. 9

M.N 3

Run;

Data _Null_;

Set have End=done;

If _n_=1 Then Do;

Declare Hash ht(Ordered:'d');

ht.DefineKey('Score');

ht.DefineData('Name','Score');

ht.DefineDone();

End;

If ht.Num_Items<3 Then Do;

ht.Add();

End;

Else Do;

/* Need something like this:

If ht.Last/Score < Score Then Do;

ht.RemoveLast;

ht.Add;

End;

*/

End;

If done Then ht.Output(Dataset:'So_far');

Run;

Data want1;

Input Name $ Score;

Datalines;

C.D. 10

I.J. 10

K.L. 9

Run;

Data want2;

Input Name $ Score;

Datalines;

C.D. 10

I.J. 10

K.L. 9

A.B. 6

Run;

Haikuo · Posted 12-17-2013 04:05 PM

I concur 's sentiment and comment. From the nature of a Hash table, as long as your RAM can hold it, the bigger the table, the better benefits you get from it. Only a few highest scores in the a Hash table seems to be an overkill of using it. But without knowing the whole picture, it is too soon to call. Technical speaking it is of course doable using Hash table:

Data have;

Input Name $ Score;

Datalines;

A.B. 6
C.D. 10
E.F. 5
G.H. 3
I.J. 10
K.L. 9
M.N 3

Run;

/*Want1*/

Data _Null_;

Set have End=done;

If _n_=1 Then Do;

Declare Hash ht(Ordered:'d', multidata:'y');

ht.DefineKey('_Score');

ht.DefineData('_Name','_Score');

ht.DefineDone();

declare hiter hi('ht');

End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

rc=hi.last();

if score > _score then do;

_ss=_score;

_score=score;

_name=name;

rc=ht.add();

rc=hi.first();

rc=ht.remove(key:_ss);

end;

If done Then ht.Output(Dataset:'ties_count');

Run;

/*Want2*/

Data _Null_;

Set have End=done;

If _n_=1 Then Do;

Declare Hash ht(Ordered:'d', multidata:'y');

ht.DefineKey('_Score');

ht.DefineData('_Name','_Score');

ht.DefineDone();

declare hiter hi('ht');

End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

if ht.find(key:score) = 0 then do; _name=name;_score=score; rc=ht.add();end;

else do;

rc=hi.last();

if score > _score then do;

_ss=_score;

_score=score;

_name=name;

rc=ht.add();

rc=hi.first();

rc=ht.remove(key:_ss);

end;

If done Then ht.Output(Dataset:'ties_not_count');

Run;

Haikuo

View solution in original post

AhmedAl_Attar · Posted 12-17-2013 12:19 PM

Do you really need hash for this?

I ask, because you can easily do it using Proc Rank.

proc rank data=have out=want descending ties=dense;

var score;

ranks scoreRank;

run;

proc sort data=want;

where scoreRank <= 3;

by descending score ;

run;

proc print data=want; run;

Georg_UPB · Posted 12-17-2013 12:23 PM

Actually, dataset "have" is so large, that I'm going through it via "File" & "Input" without creating output.

So, unfortunately, "Proc Rank" is not an option...

AhmedAl_Attar · Posted 12-17-2013 01:27 PM

How about this?

Data have/view=have;

infile "My Documents/sample_have.txt";

Input Name $ Score;

Run;

proc rank data=have out=want(where=(scoreRank <= 3)) descending ties=dense;

var score;

ranks scoreRank;

run;

Georg_UPB · Posted 12-17-2013 01:45 PM

Thank you very much for this solution!

Since I became really fond of hash tables, I'll just wait a little longer to see if there's also a nice solution that takes advantage of them before I mark your answer and close the topic.

The real dataset "have" requires some more grooming that can be easily done using hash tables... There's only the problem of extracting data with certain features (highest values etc.) which I thought could be accomplished by having the hash table sortered in descendant order.

Astounding · Posted 12-17-2013 02:04 PM

The way you have described it, it seems like you need to find the 3 highest scores, and the names associated with those scores. I would think that arrays would be a much better bet. For example:

data want;

array scores {3} score1-score3;

array names {3} $8 name1-name3;

do until (done);

set have end=done;

if score > score1 then do;

score3=score2;

score2=score1;

score1=score;

name3=name2;

name2=name1;

name1=name;

end;

else if score > score2 then do;

score3=score2;

score2=score;

name3=name2;

name2=name;

end;

else if score > score3 then do;

score3=score;

name3=name;

end;

output;

stop;

keep name1-name3 score1-score3;

run;

Because of the DO loop, there is no need to retain any of the variables.

If you need this information for processing another data set, you could use _TEMPORARY_ arrays instead of permanent. _TEMPORARY_ array elements do not get erased.

You could in theory replace the OUTPUT and STOP statements with a SET statement that reads another data set and processes observations based on NAME1-NAME3 and SCORE1-SCORE3. In that case, however, you would have to make sure that either these variables are retained or that you switch to _TEMPORARY_ arrays.

I'm not sure if you were hoping to get practice with hash tables here instead. But I would think that arrays are at least a good approach to the problem you described.

Good luck.

Georg_UPB · Posted 12-17-2013 02:26 PM

Thank you very much!

Arrays are also a very nice solution. Only the fact that "have" has 96 variables might cause some trouble, but otherwise this should work nicely.

Hash tables are an ideal solution for the operations I need to perform on "have". It's all working perfectly fine and very fast...

...up to the point that I don't want to populate my hash table with ALL entries, but only want to keep the nth highest values regarding certain fields...

Astounding · Posted 12-17-2013 03:03 PM

OK, the plot always thickens. In that case, don't bother saving NAME1-NAME3. Instead, save OBSNO1-OBSNO3, the observation number from HAVE. It should be a simple matter afterwards to retrieve just those three observations using SET and POINT=. I'd need further details of how you intend to use those observations to consider whether it makes sense to load them into a hash table at that point.

Haikuo · Posted 12-17-2013 04:05 PM

I concur 's sentiment and comment. From the nature of a Hash table, as long as your RAM can hold it, the bigger the table, the better benefits you get from it. Only a few highest scores in the a Hash table seems to be an overkill of using it. But without knowing the whole picture, it is too soon to call. Technical speaking it is of course doable using Hash table:

Data have;

Input Name $ Score;

Datalines;

A.B. 6
C.D. 10
E.F. 5
G.H. 3
I.J. 10
K.L. 9
M.N 3

Run;

/*Want1*/

Data _Null_;

Set have End=done;

If _n_=1 Then Do;

Declare Hash ht(Ordered:'d', multidata:'y');

ht.DefineKey('_Score');

ht.DefineData('_Name','_Score');

ht.DefineDone();

declare hiter hi('ht');

End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

rc=hi.last();

if score > _score then do;

_ss=_score;

_score=score;

_name=name;

rc=ht.add();

rc=hi.first();

rc=ht.remove(key:_ss);

end;

If done Then ht.Output(Dataset:'ties_count');

Run;

/*Want2*/

Data _Null_;

Set have End=done;

If _n_=1 Then Do;

Declare Hash ht(Ordered:'d', multidata:'y');

ht.DefineKey('_Score');

ht.DefineData('_Name','_Score');

ht.DefineDone();

declare hiter hi('ht');

End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

if ht.find(key:score) = 0 then do; _name=name;_score=score; rc=ht.add();end;

else do;

rc=hi.last();

if score > _score then do;

_ss=_score;

_score=score;

_name=name;

rc=ht.add();

rc=hi.first();

rc=ht.remove(key:_ss);

end;

If done Then ht.Output(Dataset:'ties_not_count');

Run;

Haikuo

Georg_UPB · Posted 12-18-2013 06:52 AM

Thank you very much!

That's exactly what I was looking for!

Vince28_Statcan · Posted 12-17-2013 04:07 PM

Solution using hash tables power to ease things up...

You've gone the right way using ordering, however, you misunderstood how your datarows are appended to the hash object when you don't use a dataset: statement. You would need to use a hash iterator object to retrieve first or last 3 hash object items after all of it has been loaded and sorted.

Here is a simple hash solution that still uses data step to output. You could use hash object output but you would need to create 2 or 3 hashes for simplicity or do some hardcore .remove() implementation.

Data have;
Input Name $ Score;
Datalines;
A.B. 6
C.D. 10
E.F. 5
G.H. 3
I.J. 10
K.L. 9
M.N 3
;
Run;

data want1 want2;
    if 0 then set have; /* avoid length statements */
    If _n_=1 Then Do;
        Declare Hash ht(dataset: 'have', Ordered:'d', multidata: 'Y'); /* load*/
        Declare Hiter hi('ht'); /* hash iterator object declared on hash object HT */
        ht.DefineKey('Score');
        ht.DefineData('Name','Score');
        ht.DefineDone();
    End;

     /* to achieve want1 and want2 logic */
     i=0; j=0;
     do while (j<3);
         hi.next(); /* loop through sorted data using iterator. If iterator has not been used before, then hi.next() is equivalent to hi.first() */
          if i<3 then output want1;
          output want2;
          i+1;
          if last_score NE score then j+1;
          last_score=score;
     end;
     drop i j last_score;
run;

I've put the interesting hash elements in bold the rest is similar logic that you would hard code if you used sorted data. Please note the use of multidata:'y' option as otherwise only CD or IJ could exist within the hash object as their KEY is identical.

Georg_UPB · Posted 12-18-2013 07:00 AM

Thanks for pointing out this alternative!

Considering the size of the "real" have-dataset, I'm trying to populate the hash table with as few data as possible.

Vince28_Statcan · Posted 12-18-2013 08:05 AM

I see. Well obviously, my solution causes a single pass on the data instead of two cutting I/O by ~half (well only input but still a large chunk of your total I/O). However loading a full hash object isn't always manageable.

It is possible however to manage to output each want1 and want2 in a single read of the data using 3 hash objects each with at most 4 distinct key values at any given time so reasonably little total memory space saving a giant input operation. I will try to come provide the additionnal example during my lunch break as the logic is slightly less natural at least for my hash habits to spare a small break for it. It won't necessarily be a giant breakthrough but if I think it's a worthwhile self-learning I would assume it can benefit to others as well!

Vincent

Vince28_Statcan · Posted 12-18-2013 01:39 PM

As mentionned, here's an alternative that will produce both output within a single pass on the data effectively saving you a large I/O since the reason why you didn't load the entire table in a hash was for memory issue.

data _null_;
    if 0 then set have; /* avoid length statements */
    If _n_=1 Then Do;
        Declare Hash ht1(Ordered:'d', multidata: 'Y', hashexp: 2); /* load*/
        Declare Hiter hi1('ht1'); /* hash iterator object declared on hash object HT */
        ht1.DefineKey('Score');
        ht1.DefineData('Name','Score');
        ht1.DefineDone();

declare hash ht2(ordered:'d', multidata: 'Y', hashexp: 2);
declare hiter hi2('ht2');
        ht2.DefineKey('Score');
        ht2.DefineData('Name','Score');
        ht2.DefineDone();

declare hash kt2(ordered: 'd', hashexp: 2);
declare hiter ki2('kt2');
        kt2.DefineKey('Score');
        kt2.DefineData('Score');
        kt2.DefineDone();
    End;

set have end=done;
ht1.add();
kt2.replace();
ht2.add();

if ht1.num_items>3 then do;
hi1.last();
rc=hi1.next();
ht1.remove();
end;

if kt2.num_items>3 then do;
ki2.last();
rc=ki2.next();
kt2.remove();
ht2.remove();
end;

if done then do;
ht1.output(dataset: "want1");
ht2.output(dataset: "want2");
end;
run;

On a side note, I was very disappointed to realize that there is no documented method or trick to set the hash iterator pointer to NULL after using it to retrieve the last or first item for the purpose of removing data. Anyone knows a trick that is more straight forward, especially when it comes to reading the code, other than having to use next() after a last() to clear the pointer without changing PDV values since you ought to reuse the same keys? The approach above saves from having to declare and call missing every data step loop on a full subset of variables with the same name as the set itself and it solely uses the sortation within the hash and an iterator to remove data but because there exists no way to set an iterator pointer to NULL, I had to waste PDV space and a bunch of useless iteration for RC=iterator.next() as without the RC=, it creates an error (data step still processes). Plus, it does not feel natural to have to tell a pointer on the known last element of a sequence to point to the next element so as to clear it from locking the removal of data/key elements.

Anyway end of the rant about iterator having no method to change the pointer position without affecting PDV other than next on a last or prev on a first.

Georg_UPB · Posted 12-19-2013 01:01 PM

Thank you very much for taking the time! I learnt a lot from your examples!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away