DATA Step, Macro, Functions and more

Question about Hash Tables

Reply
Contributor
Posts: 26

Question about Hash Tables

Hello,

I'm loading sorted data set into hash. I need to have data in hash exactly in the same order as in data set.

My question is: Are the observations always stored in the hash in the same order as in data set?

Because data are not sorted by the hash key, I can't use tag- ordered in DECLARE statement.    

Respected Advisor
Posts: 3,799

Re: Question about Hash Tables

I ran this and it suggest the answer to your question is no.  Assuming I understand the question. 

data class;

   set sashelp.class;

   by name;

   run;

proc print;

   title 'sorted by name';

   run;

data _null_;

   if 0 then set class;

   declare hash h(dataset:'class');

   h.definekey('sex','age','name');

   h.definedata(all:'YES');

   h.definedone();

   h.output(dataset:'class2');

   stop;

   run;

proc print data=class2;

   title 'not now';

   run;


Makes me wonder what you are trying to do.

Contributor
Posts: 26

Re: Question about Hash Tables

Posted in reply to data_null__

Thanks for response. My question was not precise. I will try ask one more time Smiley Happy

I'm loading sorted data set into hash. Data set is sorted by 'created_date' column (descending).

Hash key is on column 'phone'. Vaules in 'phone' contains duplicates.

created_date   phone

10-10-2011      607

10-10-2011      122

09-10-2011      607

08-10-2011      607

07-10-2011      122

07-10-2011      607

I want to be sure that in hash always will be phone with the latest created_date.

Because data set is large I don't want to make any additional transformations before loading it into hash.

My question was not precise
Super User
Posts: 10,041

Re: Question about Hash Tables

Just as Null said.Answer is no. Hash will not keep this order for you. you need to make a index key by yourself.

define hash ha(ordered:'a');

ha.definekey('count');

...............

do until(last);

set temp;

count+1;

ha.add();

end;

Ksharp

Contributor
Posts: 26

Re: Question about Hash Tables

But when I make an index key on new column 'count' I can't join on column 'phone'.

As I said before I have table with columns: created_date, phone. It's a lookup table.

created_date   phone

10-10-2011      607

10-10-2011      122

09-10-2011      607

08-10-2011      607

07-10-2011      122

07-10-2011      607

...                   ...

I have also transaction table with column phone.

phone

529

122

697

....

I want to join these two tables using hash.

Is it possible to make index key on two columns created_date + phone (then could use tag 'ordered' ), but join (with find() method) only on one column (phone).

Super User
Super User
Posts: 7,060

Re: Question about Hash Tables

Doesn't sound like you want to make a hash that has all of the values of created_date, just the latest date for each value of phone?

Phone  Created_date

607 10-10-2011

122 10-10-2011

...

I am not that good with hash tables, but it seems like during the load of the hash from the source dataset you only want to add items when the key is new.

Super User
Posts: 10,041

Re: Question about Hash Tables

you need some special methods for multidata.  HAS_NEXT() and FIND_NEXT()  to retrieve these obs which has duplicated key.

define hash ha(ordered:'d');

ha.definekey('phone','date');

Valued Guide
Posts: 2,177

Re: Question about Hash Tables

part of a response from OP was

I want to be sure that in hash always will be phone with the latest created_date.

Because data set is large I don't want to make any additional transformations before loading it into hash.

This suggests only the latest "created_date" is needed.

If so,, use the loading stage to discover and update only the latest

No sorting, no indexing, just unique by phone and holding only the latest date

Super Contributor
Posts: 387

Question about Hash Tables

Hi,

A few points:

See http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a002585310.htm, in particular:

'NO' | 'N'      Data is returned in an undefined order, for the ordered tag (NO is the default).

Also, understand the difference between the ADD() and REPLACE() methods.  You might want to load the hash object with do until (eof); set whatever end=eof; h.add() construct.

You could use two hash objects and a hash iterator to load a hash object, let the first hash object do your desired processing, and the second hash object write your output dataset in desired sort order.  You'd have to test if this gives you the perfornance you want.

HTH...

Scott

Super User
Super User
Posts: 7,060

Re: Question about Hash Tables

There are probably ways of doing this better, but here is one way to read your example data into a hash that has unique records and use it to lookup the date for your transaction records.

data dates;

  input date ddmmyy10. number @@;

  format date ddmmyy10.;

cards;

10-10-2011 697 10-10-2011 122 09-10-2011 607 08-10-2011 607

07-10-2011 122 07-10-2011 607

run;

data trans;

  input number @@ ;

cards;

529 122 697 607 888

run;

data lookup ;

if _n_=1 then do;

  if 0 then set dates;

  declare Hash H();

  H.DefineKey ('number');

  H.DefineData ('date');

  H.DefineDone ();   /* complete hash table definition */

  do until(eof);

    set dates end=eof;

    if h.find() then h.add();

  end;

end;

set trans;

if h.find() then date=.;

put number= date=;

run;

number=529 date=.

number=122 date=10/10/2011

number=697 date=10/10/2011

number=607 date=09/10/2011

number=888 date=.

Contributor
Posts: 26

Re: Question about Hash Tables

Hi,

Thank you all for responses.

@Tom's solution is very usefull but i could be done with less coding.

data dates;

  input date ddmmyy10. number @@;

  format date ddmmyy10.;

cards;

10-10-2011 697 10-10-2011 122 09-10-2011 607 08-10-2011 607

07-10-2011 122 07-10-2011 607

run;

data trans;

  input number @@ ;

cards;

529 122 697 607 888

run;

data lookup ;

if _n_=1 then do;

  if 0 then set dates;

  declare Hash H(dataset: "dates");

  H.DefineKey ('number');

  H.DefineData ('date');

  H.DefineDone ();   /* complete hash table definition */

end;

set trans;

if h.find() then date=.;

put number= date=;

run;

According to the documentation (link from @Scott Bass):

If the data set contains duplicate keys, by default, the first instance is stored in the hash object; subsequent instances are ignored. To store the last instancein the hash object, use the DUPLICATE argument tag. The DUPLICATE argumenttag also writes an error to the SAS log if there is a duplicate key.

As I understand, during loading hash the source table is reading sequentially. Sorting before loading hash should be enough. I hope Smiley Happy

Valued Guide
Posts: 2,177

Re: Question about Hash Tables

Lucas

for a small amount of extra code, I was intent on avoiding

. Sorting before loading hash should be enough. I hope 


peter

Super User
Super User
Posts: 7,060

Re: Question about Hash Tables

If the data is not sorted then you can just decide whether to replace the found value with the new one.

data lookup ;

if _n_=1 then do;

  if 0 then set dates;

  declare Hash H();

  H.DefineKey ('number');

  H.DefineData ('date');

  H.DefineDone ();   /* complete hash table definition */

  do until(eof);

    set dates(rename=(date=nextdate)) end=eof;

    if h.find() then do;

      date=nextdate; h.add();

    end;

    else if nextdate > date then do;

      date=nextdate; h.replace();

    end;

  end;

end;

set trans;

if h.find() then date=.;

put number= date=;

run;

Super Contributor
Posts: 387

Re: Question about Hash Tables

Hi,

According to the documentation (link from @Scott Bass):

If the data set contains duplicate keys, by default, the first instance is stored in the hash object; subsequent instances are ignored. To store the last instancein the hash object, use the DUPLICATE argument tag. The DUPLICATE argumenttag also writes an error to the SAS log if there is a duplicate key.

Actually, what I meant was, you need to understand the difference between:

  declare Hash H(dataset: "dates");

  H.DefineKey ('number');

  H.DefineData ('date');

  H.DefineDone ();   /* complete hash table definition */

And (untested):

declare Hash H();

H.DefineKey ('number');

H.DefineData ('date');

do until (eof);

  set dates end=eof;

  H.Add();

end;


H.DefineDone ();   /* complete hash table definition */

And:

declare Hash H();

H.DefineKey ('number');

H.DefineData ('date');

do until (eof);

  set dates end=eof;

  H.Replace();

end;


H.DefineDone ();   /* complete hash table definition */

Assuming a sorted order of your keys, I view Add() as "keep the first occurence" and Replace() as "keep the last occurence" of the key value.  Somewhat analogous to first. and last. processing (just an analogy, not identical!)

I didn't read the finer points of the original post, so this knowledge may or may not be relevant, but it's good to know regardless.

I have sometimes used two hash objects, the first to "de-dup", or otherwise get my desired data into a hash, and the second hash object (using different keys) to return the data in desired sort order.

Regards,

Scott

Ask a Question
Discussion stats
  • 13 replies
  • 379 views
  • 6 likes
  • 6 in conversation