Solved: Efficient hash Look-up

OLUGBOJI · Posted 03-27-2019 05:12 PM

I am validating some datasets and I wish to look up postal codes in the postal_history dataset. The postal history dataset consists of postal codes for each observation for 10 years. I am trying to see if I can validate by matching postal codes. There are millions of records in the postal_history dataset so I do not want to do a join on

(Postal eq post1 or postal eq post2 or postal eq post3... ) for example. Is there a more efficient way to do this.

Table: postal_history

ID	Post1	Post2	Post3	Post4	Post5	Postal1	Postal2	Postal3	Postal4	Postal5
1	P2e3	P2e4	P2e5	P2e6	P2e7	P2e8	P2e9	P2e0	P2e4	P2e8
2	P2e7	P2e8	P2e4	P2e4	P2f6	P2e4	P2e8	P2e8	P2e8	P2f1

Table: invalid

id	postal	Sex	dob	Address1	phone
1	P2e6	m	09/06/04	f	123456
2	P2e9	m	09/06/05	g	654321
3	P2e8	f	09/06/06	h	789012

Table: want

id	postal	Sex	dob	Address1	phone	match
1	P2e6	m	09/06/04	f	123456	match
2	P2e9	m	09/06/05	g	654321	unmatch
3	P2e8	f	09/06/06	h	789012	match

have => postal_history and invalid tables

want=>want;

novinosrin · Posted 03-27-2019 05:45 PM

data history;
input ID $	(Post1	Post2	Post3	Post4	Post5	Postal1	Postal2	Postal3	Postal4	Postal5) ($);
cards;
1	P2e3	P2e4	P2e5	P2e6	P2e7	P2e8	P2e9	P2e0	P2e4	P2e8
2	P2e7	P2e8	P2e4	P2e4	P2f6	P2e4	P2e8	P2e8	P2e8	P2f1
;
data invalid;
input (id	postal	Sex	dob	Address1	phone) ($);
cards;
1	P2e6	m	9/6/2004	f	123456
2	P2e9	m	9/6/2005	g	654321
3	P2e8	f	9/6/2006	h	789012
;
data want;
if _n_=1 then do;
if 0 then set invalid;
dcl hash H () ;
	h.definekey  ("id","post") ;
   h.definedone () ;
   length match $10 post $8;
   call missing(post);
   do until(lr);
   set history end=lr;
   array t(*) Post1--Postal5;
   do i= 1 to dim(t);
   post=t(i);
   rc=h.add();
   end;
   end;
end;
set invalid;
if h.check(key:id,key:postal)=0 then match='Match';
else  Match='No Match';
drop Post1--Postal5 post i rc;
run;

@OLUGBOJI Sounds like you want that in hash

View solution in original post

novinosrin · Posted 03-27-2019 05:16 PM

Not clear with the WANT there. Can you clarify your WANT output and the HAVEs plz?

andreas_lds · Posted 03-27-2019 05:22 PM

Sorry, not clear what you want. And please post data in usable form.

andreas_lds · Posted 03-27-2019 05:29 PM

untested code, assuming that both datasets are sorted by id and id is unique in both datasets:

data want;
  merge invalid(in=needsProcessing) postal_history;
  by id;

  length match $ 7;

  if needsProcessing;
  
  if findw(catx(' ', Post1, Post2, ...), Postal) then match ='match';
  else match = 'unmatch';

  drop Post1 Post 2 ....;
run;

OLUGBOJI · Posted 03-27-2019 05:35 PM

Thanks. id is unique. Data was not sorted initially and using Hash object would not need sorting.

novinosrin · Posted 03-27-2019 05:45 PM

data history;
input ID $	(Post1	Post2	Post3	Post4	Post5	Postal1	Postal2	Postal3	Postal4	Postal5) ($);
cards;
1	P2e3	P2e4	P2e5	P2e6	P2e7	P2e8	P2e9	P2e0	P2e4	P2e8
2	P2e7	P2e8	P2e4	P2e4	P2f6	P2e4	P2e8	P2e8	P2e8	P2f1
;
data invalid;
input (id	postal	Sex	dob	Address1	phone) ($);
cards;
1	P2e6	m	9/6/2004	f	123456
2	P2e9	m	9/6/2005	g	654321
3	P2e8	f	9/6/2006	h	789012
;
data want;
if _n_=1 then do;
if 0 then set invalid;
dcl hash H () ;
	h.definekey  ("id","post") ;
   h.definedone () ;
   length match $10 post $8;
   call missing(post);
   do until(lr);
   set history end=lr;
   array t(*) Post1--Postal5;
   do i= 1 to dim(t);
   post=t(i);
   rc=h.add();
   end;
   end;
end;
set invalid;
if h.check(key:id,key:postal)=0 then match='Match';
else  Match='No Match';
drop Post1--Postal5 post i rc;
run;

@OLUGBOJI Sounds like you want that in hash

novinosrin · Posted 03-27-2019 06:56 PM

@OLUGBOJI This is better i think


data history;
input ID $	(Post1	Post2	Post3	Post4	Post5	Postal1	Postal2	Postal3	Postal4	Postal5) ($);
cards;
1	P2e3	P2e4	P2e5	P2e6	P2e7	P2e8	P2e9	P2e0	P2e4	P2e8
2	P2e7	P2e8	P2e4	P2e4	P2f6	P2e4	P2e8	P2e8	P2e8	P2f1
;
data invalid;
input (id	postal	Sex	dob	Address1	phone) ($);
cards;
1	P2e6	m	9/6/2004	f	123456
2	P2e9	m	9/6/2005	g	654321
3	P2e8	f	9/6/2006	h	789012
;

data want;
if _n_=1 then do;
if 0 then set invalid;
dcl hash H (dataset:'invalid') ;
	h.definekey  ("id") ;
   h.definedata (all:'y') ;
   h.definedone();
   end;
set history;
array t(*) Post1--Postal5;
length match $10;
rc=h.find();
if postal in t then match='Match';
else Match='No Match';
drop rc Post1--Postal5;
run;

novinosrin · Posted 03-27-2019 05:33 PM


data history;
input ID $	(Post1	Post2	Post3	Post4	Post5	Postal1	Postal2	Postal3	Postal4	Postal5) ($);
cards;
1	P2e3	P2e4	P2e5	P2e6	P2e7	P2e8	P2e9	P2e0	P2e4	P2e8
2	P2e7	P2e8	P2e4	P2e4	P2f6	P2e4	P2e8	P2e8	P2e8	P2f1
;
data invalid;
input (id	postal	Sex	dob	Address1	phone) ($);
cards;
1	P2e6	m	9/6/2004	f	123456
2	P2e9	m	9/6/2005	g	654321
3	P2e8	f	9/6/2006	h	789012
;
data want;
merge invalid history;
by id;
array t(*) Post1--Postal5;
length match $10;
if postal in t then match='Match';
else Match='No Match';
run;

Note: ID3 is missing in history table

Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

Re: Efficient hash Look-up

SAS Innovate 2025: Register Now