BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Georg_UPB
Fluorite | Level 6

Hi everyone!

When going through dataset "have" shown in the example below I only want to keep the three highest scores:

     "want1": three entries with the highest scores would be nice

     "want2": entries with the three highest scores would be even nicer

As soon as I try to accomplish this my code gets pretty complicated and its not really working...

As I'm new to hash tables, I have a feeling that there is an easy solution that I'm just missing... Any help is appreciated!!

Data have;

    Input Name $ Score;

Datalines;

A.B. 6

C.D. 10

E.F. 5

G.H. 3

I.J. 10

K.L. 9

M.N 3

Run;

Data _Null_;

    Set have End=done;

    If _n_=1 Then Do;

        Declare Hash ht(Ordered:'d');

        ht.DefineKey('Score');

        ht.DefineData('Name','Score');

        ht.DefineDone();

    End;

    If ht.Num_Items<3 Then Do;

        ht.Add();

    End;

    Else Do;

        /* Need something like this:

            If ht.Last/Score < Score Then Do;

                ht.RemoveLast;

                ht.Add;

            End;

        */

    End;

    If done Then ht.Output(Dataset:'So_far');

Run;

Data want1;

    Input Name $ Score;

Datalines;

C.D. 10

I.J. 10

K.L. 9

Run;

Data want2;

    Input Name $ Score;

Datalines;

C.D. 10

I.J. 10

K.L. 9

A.B. 6

Run;

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

I concur 's sentiment and comment. From the nature of a Hash table, as long as your RAM can hold it, the bigger the table, the better  benefits you get from it. Only a few highest scores in the a Hash table seems to be an overkill of using it. But without knowing the whole picture, it is too soon to call. Technical speaking it is of course doable using Hash table:

Data have;

    Input Name $ Score;

Datalines;

  1. A.B. 6
  2. C.D. 10
  3. E.F. 5
  4. G.H. 3
  5. I.J. 10
  6. K.L. 9
  7. M.N 3

Run;

/*Want1*/

Data _Null_;

    Set have End=done;

    If _n_=1 Then Do;

        Declare Hash ht(Ordered:'d', multidata:'y');

        ht.DefineKey('_Score');

        ht.DefineData('_Name','_Score');

        ht.DefineDone();

            declare hiter hi('ht');

    End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

  rc=hi.last();

  if score > _score then do;

    _ss=_score;

      _score=score;

      _name=name;

      rc=ht.add();

      rc=hi.first();

      rc=ht.remove(key:_ss);

   end;

end;

    If done Then ht.Output(Dataset:'ties_count');

Run;

/*Want2*/

Data _Null_;

    Set have End=done;

    If _n_=1 Then Do;

        Declare Hash ht(Ordered:'d', multidata:'y');

        ht.DefineKey('_Score');

        ht.DefineData('_Name','_Score');

        ht.DefineDone();

            declare hiter hi('ht');

    End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

  if ht.find(key:score) = 0 then do; _name=name;_score=score; rc=ht.add();end;

  else do;

  rc=hi.last();

  if score > _score then do;

    _ss=_score;

      _score=score;

      _name=name;

      rc=ht.add();

      rc=hi.first();

      rc=ht.remove(key:_ss);

   end;

end;

end;

    If done Then ht.Output(Dataset:'ties_not_count');

Run;

Haikuo

View solution in original post

21 REPLIES 21
AhmedAl_Attar
Rhodochrosite | Level 12

Do you really need hash for this?

I ask, because you can easily do it using Proc Rank.

proc rank data=have out=want descending ties=dense;

     var score;

   ranks scoreRank;

run;

proc sort data=want;

    where scoreRank <= 3;

    by descending score ;

run;

proc print data=want; run;

Georg_UPB
Fluorite | Level 6

Actually, dataset "have" is so large, that I'm going through it via "File" & "Input" without creating output.

So, unfortunately, "Proc Rank" is not an option...

AhmedAl_Attar
Rhodochrosite | Level 12

How about this?

Data have/view=have;

    infile "My Documents/sample_have.txt";

    Input Name $ Score;

Run;

proc rank data=have out=want(where=(scoreRank <= 3)) descending ties=dense;

     var score;

   ranks scoreRank;

run;

Georg_UPB
Fluorite | Level 6

Thank you very much for this solution!

Since I became really fond of hash tables, I'll just wait a little longer to see if there's also a nice solution that takes advantage of them before I mark your answer and close the topic.

The real dataset "have" requires some more grooming that can be easily done using hash tables... There's only the problem of extracting data with certain features (highest values etc.) which I thought could be accomplished by having the hash table sortered in descendant order.

Astounding
PROC Star

The way you have described it, it seems like you need to find the 3 highest scores, and the names associated with those scores.  I would think that arrays would be a much better bet.  For example:

data want;

   array scores {3} score1-score3;

   array names {3} $8 name1-name3;

   do until (done);

      set have end=done;

      if score > score1 then do;

         score3=score2;

         score2=score1;

         score1=score;

         name3=name2;

         name2=name1;

         name1=name;

      end;

      else if score > score2 then do;

          score3=score2;

          score2=score;

          name3=name2;

          name2=name;

      end;

      else if score > score3 then do;

          score3=score;

          name3=name;

      end;

   end;

   output;

   stop;

   keep name1-name3 score1-score3;

run;

Because of the DO loop, there is no need to retain any of the variables.

If you need this information for processing another data set, you could use _TEMPORARY_ arrays instead of permanent.  _TEMPORARY_ array elements do not get erased.

You could in theory replace the OUTPUT and STOP statements with a SET statement that reads another data set and processes observations based on NAME1-NAME3 and SCORE1-SCORE3.  In that case, however, you would have to make sure that either these variables are retained or that you switch to _TEMPORARY_ arrays.

I'm not sure if you were hoping to get practice with hash tables here instead.  But I would think that arrays are at least a good approach to the problem you described.

Good luck.

Georg_UPB
Fluorite | Level 6

Thank you very much!

Arrays are also a very nice solution. Only the fact that "have" has 96 variables might cause some trouble, but otherwise this should work nicely.

Hash tables are an ideal solution for the operations I need to perform on "have". It's all working perfectly fine and very fast...

...up to the point that I don't want to populate my hash table with ALL entries, but only want to keep the nth highest values regarding certain fields...

Astounding
PROC Star

OK, the plot always thickens.  In that case, don't bother saving NAME1-NAME3.  Instead, save OBSNO1-OBSNO3, the observation number from HAVE.  It should be a simple matter afterwards to retrieve just those three observations using SET and POINT=.  I'd need further details of how you intend to use those observations to consider whether it makes sense to load them into a hash table at that point.

Haikuo
Onyx | Level 15

I concur 's sentiment and comment. From the nature of a Hash table, as long as your RAM can hold it, the bigger the table, the better  benefits you get from it. Only a few highest scores in the a Hash table seems to be an overkill of using it. But without knowing the whole picture, it is too soon to call. Technical speaking it is of course doable using Hash table:

Data have;

    Input Name $ Score;

Datalines;

  1. A.B. 6
  2. C.D. 10
  3. E.F. 5
  4. G.H. 3
  5. I.J. 10
  6. K.L. 9
  7. M.N 3

Run;

/*Want1*/

Data _Null_;

    Set have End=done;

    If _n_=1 Then Do;

        Declare Hash ht(Ordered:'d', multidata:'y');

        ht.DefineKey('_Score');

        ht.DefineData('_Name','_Score');

        ht.DefineDone();

            declare hiter hi('ht');

    End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

  rc=hi.last();

  if score > _score then do;

    _ss=_score;

      _score=score;

      _name=name;

      rc=ht.add();

      rc=hi.first();

      rc=ht.remove(key:_ss);

   end;

end;

    If done Then ht.Output(Dataset:'ties_count');

Run;

/*Want2*/

Data _Null_;

    Set have End=done;

    If _n_=1 Then Do;

        Declare Hash ht(Ordered:'d', multidata:'y');

        ht.DefineKey('_Score');

        ht.DefineData('_Name','_Score');

        ht.DefineDone();

            declare hiter hi('ht');

    End;

if _n_ <= 3 then do; _name=name;_score=score; rc=ht.add();end;

else do;

  if ht.find(key:score) = 0 then do; _name=name;_score=score; rc=ht.add();end;

  else do;

  rc=hi.last();

  if score > _score then do;

    _ss=_score;

      _score=score;

      _name=name;

      rc=ht.add();

      rc=hi.first();

      rc=ht.remove(key:_ss);

   end;

end;

end;

    If done Then ht.Output(Dataset:'ties_not_count');

Run;

Haikuo

Georg_UPB
Fluorite | Level 6

Thank you very much!

That's exactly what I was looking for!

Vince28_Statcan
Quartz | Level 8

Solution using hash tables power to ease things up...

You've gone the right way using ordering, however, you misunderstood how your datarows are appended to the hash object when you don't use a dataset: statement. You would need to use a hash iterator object to retrieve first or last 3 hash object items after all of it has been loaded and sorted.

Here is a simple hash solution that still uses data step to output. You could use hash object output but you would need to create 2 or 3 hashes for simplicity or do some hardcore .remove() implementation.

Data have;
    Input Name $ Score;
Datalines;
A.B. 6
C.D. 10
E.F. 5
G.H. 3
I.J. 10
K.L. 9
M.N 3
;
Run;


data want1 want2;
    if 0 then set have; /* avoid length statements */
    If _n_=1 Then Do;
        Declare Hash ht(dataset: 'have', Ordered:'d', multidata: 'Y'); /* load*/
        Declare Hiter hi('ht'); /* hash iterator object declared on hash object HT */
        ht.DefineKey('Score');
        ht.DefineData('Name','Score');
        ht.DefineDone();
    End;


     /* to achieve want1 and want2 logic */
     i=0; j=0;
     do while (j<3);
         hi.next(); /* loop through sorted data using iterator. If iterator has not been used before, then hi.next() is equivalent to hi.first() */
          if i<3 then output want1;
          output want2;
          i+1;
          if last_score NE score then j+1;
          last_score=score;         
     end;
     drop i j last_score;
run;

I've put the interesting hash elements in bold the rest is similar logic that you would hard code if you used sorted data. Please note the use of multidata:'y' option as otherwise only CD or IJ could exist within the hash object as their KEY is identical.

Georg_UPB
Fluorite | Level 6

Thanks for pointing out this alternative!

Considering the size of the "real" have-dataset, I'm trying to populate the hash table with as few data as possible.

Vince28_Statcan
Quartz | Level 8

I see. Well obviously, my solution causes a single pass on the data instead of two cutting I/O by ~half (well only input but still a large chunk of your total I/O). However loading a full hash object isn't always manageable.

It is possible however to manage to output each want1 and want2 in a single read of the data using 3 hash objects each with at most 4 distinct key values at any given time so reasonably little total memory space saving a giant input operation. I will try to come provide the additionnal example during my lunch break as the logic is slightly less natural at least for my hash habits to spare a small break for it. It won't necessarily be a giant breakthrough but if I think it's a worthwhile self-learning I would assume it can benefit to others as well!

Vincent

Vince28_Statcan
Quartz | Level 8

As mentionned, here's an alternative that will produce both output within a single pass on the data effectively saving you a large I/O since the reason why you didn't load the entire table in a hash was for memory issue.

data _null_;
    if 0 then set have; /* avoid length statements */
    If _n_=1 Then Do;
        Declare Hash ht1(Ordered:'d', multidata: 'Y', hashexp: 2); /* load*/
        Declare Hiter hi1('ht1'); /* hash iterator object declared on hash object HT */
        ht1.DefineKey('Score');
        ht1.DefineData('Name','Score');
        ht1.DefineDone();

  declare hash ht2(ordered:'d', multidata: 'Y', hashexp: 2);
  declare hiter hi2('ht2');
        ht2.DefineKey('Score');
        ht2.DefineData('Name','Score');
        ht2.DefineDone();

  declare hash kt2(ordered: 'd', hashexp: 2);
  declare hiter ki2('kt2');
        kt2.DefineKey('Score');
        kt2.DefineData('Score');
        kt2.DefineDone();
    End;

set have end=done;
ht1.add();
kt2.replace();
ht2.add();

if ht1.num_items>3 then do;
  hi1.last();
  rc=hi1.next();
  ht1.remove();
end;

if kt2.num_items>3 then do;
  ki2.last();
  rc=ki2.next();
  kt2.remove();
  ht2.remove();
end;

if done then do;
  ht1.output(dataset: "want1");
  ht2.output(dataset: "want2");
end;
run;

On a side note, I was very disappointed to realize that there is no documented method or trick to set the hash iterator pointer to NULL after using it to retrieve the last or first item for the purpose of removing data. Anyone knows a trick that is more straight forward, especially when it comes to reading the code, other than having to use next() after a last() to clear the pointer without changing PDV values since you ought to reuse the same keys? The approach above saves from having to declare and call missing every data step loop on a full subset of variables with the same name as the set itself and it solely uses the sortation within the hash and an iterator to remove data but because there exists no way to set an iterator pointer to NULL, I had to waste PDV space and a bunch of useless iteration for RC=iterator.next() as without the RC=, it creates an error (data step still processes). Plus, it does not feel natural to have to tell a pointer on the known last element of a sequence to point to the next element so as to clear it from locking the removal of data/key elements.

Anyway end of the rant about iterator having no method to change the pointer position without affecting PDV other than next on a last or prev on a first.

Georg_UPB
Fluorite | Level 6

Thank you very much for taking the time! I learnt a lot from your examples!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 2518 views
  • 6 likes
  • 5 in conversation