Solved: Re: Request for Efficent look up from one to many datasets?

CharlotteCain · Posted 10-13-2015 02:39 PM

Dear All,

I have a need to look up values from one dataset to many datasets and impute a binary value of 1 or 0 based on whether a match is found or not, For example, I have a dataset1:

ID_1

a

g

h

b

c

e

d

dataset 2 dataset3

ID_2 ID_3

g c

k e

h a

b

Wanted_output_dataset

Dataset1 ID values should look up with dataset2 and if found the values should be imputed as 1 else 0.

ID_1 ID_2 ID_3

a 0 1

g 1 0

h 1 0

b 0 1

c 0 1

e 0 1

d 0 0

k 1 0

Please notice the values are values of g,k h of ID2 are found in ID_1, so being true should have value 1 and when not found 0 and the same logic applies for look up from ID_1 to ID_3 where C, E , A and B are found. Any help would be greatly appreciated. Thanks.

Also note, Id1, Id2 and Id3 are datasets and not just variables, so the join/hash solution if anyone knows?

Regards,

Charlotte

Astounding · Posted 10-13-2015 02:58 PM

Easy and commonplace is to sort and merge. After sorting:

data want;

merge dataset1 (in=in1)

dataset2 (in=in2 (rename=(id_2=id_1))

dataset3 (in=in3 rename=(id_3=id_1));

by id_1;

if in1;

id_2=in2;

id_3=in3;

run;

But size of data sets and the need for speed may push in another direction.

View solution in original post

Astounding · Posted 10-13-2015 02:58 PM

Easy and commonplace is to sort and merge. After sorting:

data want;

merge dataset1 (in=in1)

dataset2 (in=in2 (rename=(id_2=id_1))

dataset3 (in=in3 rename=(id_3=id_1));

by id_1;

if in1;

id_2=in2;

id_3=in3;

run;

But size of data sets and the need for speed may push in another direction.

CharlotteCain · Posted 10-13-2015 03:51 PM

Bob, You are a genius as always and I do like the logic with nice use of the automatic IN= variable. The dataset size is about a couple of million records and in essence your solution should work fine. Please do not doubt my courtesy if I mark your solution as correct tomorrow coz It's already too late in England and I will sincerely thank once I run your solution at work. I hope is it should handle more datasets if need be, i mean instead of the example 3, if it has 7 or 8 for example.

It's not the first time you have helped me out and so just thanks won't be enough. Thank you so much

Regards,

Charlotte

Also my sincere thanks to saskiwi, stat_sas, steeDC for your help.

SASKiwi · Posted 10-13-2015 03:04 PM

If you have a lot of lookups to do in the same DATA step then I would not recommend a hash join as there is too much coding required. I suggest you use a PROC FORMAT lookup.

Load your lookup datasets into PROC FORMAT using the CNTLIN option to create the lookup format then a single PUT function statement can do the lookup:

ID_2 = input(put(ID_1, $ID2_fmt.), 2.);

stat_sas · Posted 10-13-2015 03:19 PM

proc sql;
select ID_1,case when ID_1=ID_2 then 1 else 0 end as ID_2,
case when ID_1=ID_3 then 1 else 0 end as ID_3
from id_1 left join id_2 on ID_1=ID_2
left join id_3 on ID_1=ID_3;
quit;

Steelers_In_DC · Posted 10-13-2015 03:37 PM

Here is a solution:

data one;
input ID_1$;
cards;
a
g
h
b
c
e
d
;

data two;
input ID_2$;
cards;
g
k
h
;

data three;
input ID_3$;
cards;
c
e
a
b
;

proc sort data=one;by id_1;
proc sort data=two;by id_2;
proc sort data=three;by id_3;

data want;
merge one   (in=a)
      two   (in=b rename=(id_2=id_1))
      three (in=c rename=(id_3=id_1));
by id_1;
if a;
if a and b then BiVal = 2;
if a and c then BiVal = 3;
run;

Astounding · Posted 10-13-2015 03:56 PM

I'm still not sure about the size of the data sets, and which one(s) might already be sorted. But in any case, after sorting if necessary, switching from MERGE to SET will give you a significant improvement:

data want;

set dataset2 (in=in2 rename=(id_2=id_1))

dataset3 (in=in3 rename=(id_3=id_1))

dataset1 (in=in1);

by id_1;

if first.id_1 then do;

id_2=0;

id_3=0;

end;

if in2 then id_2=1;

else if in3 then id_3=1;

if in1;

run;

MERGE can be an absolute hog ... comparing its results with SET results might surprise you.

Astounding · Posted 10-13-2015 04:01 PM

Add to that: retain id_2 id_3;

CharlotteCain · Posted 10-13-2015 04:11 PM

Noted, The concept of interleaving with SET and BY is something I will have to familiarise. Thank you so much once again Bob, I will try both and will reach out if I am stuck and need to follow up 🙂

Enjoy your rest of the day,

Charlotte

MikeZdeb · Posted 10-13-2015 04:18 PM

Hi, another idea like the LOOKUP approach but not with formats (NOTE: there is no value of K in your posted DATASET1, I added one to the data set) ...

data x;
input id_1 :$1. @@;
datalines;
a g h b c e d k
;

data y;
input y :$1. @@;
datalines;
g k h
;

data z;
input z :$1. @@;
datalines;
c e a b
;

data want;
length iny inz $10;
do until (lastz);
set y z end=lastz;
iny = catt(iny,y);
inz = catt(inz,z);
end;
do until (lastx);
set x end=lastx;
id_2 = (findc(id_1,iny));
id_3 = (findc(id_1,inz));
output;
end;
keep id: ;
run;

data set WANT ...

Obs id_1 id_2 id_3

1      a        0       1
2      g        1       0
3      h        1       0
4      b        0       1
5      c        0       1
6      e        0       1
7      d        0       0
8      k        1       0

Astounding · Posted 10-13-2015 04:34 PM

Mike,

That's workable as long as your data sets are small enough, and as long as your matching values containing a single character. Once they contain multiple characters, you would be better off adding a delimiter that doesn't appear in the list. So instead of creating:

gkh

Create:

#g#k#h#

Adding the first and last delimiters makes searching easier later on.

MikeZdeb · Posted 10-13-2015 04:41 PM

Sure ... search for words rather than characters. The '#' is not in the list of default delimiters fof the FINDW function so a '|' is used in the CATX function to construct the lookup lists (INY, INZ). The TRIM function is used since the variable ID_1 has length of 10. The ^^ that precedes the FINDW function changes the result from position of the string within the lookup list to a 0 or 1. Var INY and INZ are given maximim length. If 32K is too short then use the INDEX method posted earlier.

NOTE: COMMENTS added on 10/14/15 after a request for more explanation of the SAS code

#1 As has been mentioned, this method is good if the two "LOOKUP A VALUE OF ID_1" data sets (Y and Z) are not large. The limit on large is a length of 32767 for either variable INY or INZ created with the CATX function

#2 If that's the case, use another method ... maybe the INDEX data set approach (though sorting might increase the speed of using the INDEX, neither data set X, Y, nor Z needs to be sorted as they do if MERGE is used to create variables ID_2 and ID_3).

#3 Also, if an index is present in all three data sets (indices are ID_1 data set X, ID_2 data set Y, ID_3 data set Z), you can use the MERGE solution with sorting the data sets.

data x;
input id_1 :$10. @@;
datalines;
aa gg hh bb cc ee dd kdlang
;

data y;
input y :$10. @@;
datalines;
g kdlang h
;

data z;
input z :$10. @@;
datalines;
cc ee aa bab
;

data want;

* specify a LENGTH for variables created with the CATX function;

* if there is no LENGTH specified and use of the CATX function;

* results in a variable with a length > 200;

* you will get a WARNING message in the SAS LOG (see below at end of post);
length iny inz $32767;

* use a loop to read observations in data sets Y and Z;

* create the new variables INY and INZ;

* after the loop, they look as follow;

* use the FINDW (find a WORD) function to see if a value of ID_1 in an observation in day set X;

* is present within variables either INY or INZ (shown above);

* the FINDW function returns a ZERO if the value is NOT FOUND;

* it returns the position of ID1 in INY or INZ if the value is found;

* the use of ^^ (not not) changes the value from the position to a 1 if any value is found;
do until (lastx);
set x end=lastx;
id_2 = ^^findw(iny,trim(id_1));
id_3 = ^^findw(inz,trim(id_1));
output;
end;
keep id: ;
run;

data set WANT ...

Obs id_1 id_2 id_3

1     aa          0       1
2     gg          0       0
3     hh          0       0
4     bb          0       0
5     cc          0       1
6     ee          0       1
7     dd          0       0
8     kdlang      1       0

NOTE: WARNING MESSAGE if too large (LENGTH > 200) a variable is created with any CAT function and no length has been specified earlier in the data step ...

WARNING: In a call to the CATX function, the buffer allocated for the result was not long enough to contain the concatenation of all the arguments

CharlotteCain · Posted 10-14-2015 12:26 PM

Good evening @Astounding, Your solution worked beautifully. I still am happy with the merge solution. Like i said I will give you my feedback today after a nice day at work, I am so happy and thankful to you. You are a star!

@MikeZdeb Thanks for the solution, I m in awe of all the responses. May i request you to please add some comments in your code if that's not a bother coz that would help me understand better.

Above all, thank you for the valuable time and the selfless help.

Cheers,

Charlotte

MikeZdeb · Posted 10-13-2015 04:34 PM

Hi, another idea ...

NOTE: comments added on 10/14/15 after a request for more explanation of the posted code

data x;
input id_1 :$1. @@;
datalines;
a g h b c e d k
;

* create an INDEX for data set Y;

* can be used to see if a given value of the INDEX variable ID_1;

* exists when the data set is read with SET Y/KEY=ID_1;

data y (index=(id_1));
input id_1 :$1. @@;
datalines;
g k h
;

* see comment for data set Y;

data z (index=(id_1));
input id_1 :$1. @@;
datalines;
c e a b
;

data want;

* read a value of ID_1 from data set X;
set x;

* use that value to find an observation in data set Y with the same valuye of ID_1;

set y key=id_1/unique;

* if no value is found, the value of _ERROR_ is change to 1, if found _ERROR_ is 0;

* the ^_ERROR_ changes 1 to 0, 0 to 1 (now 1 represents found, 0 is not found;

id_2 = ^_error_;

* reset the vlaue of _ERROR_ to 0;

* same process for lookup in data set Z;
_error_ = 0;
set z key=id_1/unique;
id_3 = ^_error_;
_error_ = 0;
run;

If data sets Y and Z already exist, without INDICES, you can add the INDEX to each with PROC DATASETS ...

data y;
input id_1 :$1. @@;
datalines;
g k h
;

data z;
input id_1 :$1. @@;
datalines;
c e a b
;

proc datasets lib=work noprint;
modify y;
index create id_1;
modify z;
index create id_1;
quit;

PROC CONTENTS for data set Y (looks the same for Z) ...

Alphabetic List of Variables and Attributes

# Variable Type Len

1 id_1 Char 1

Alphabetic List of Indexes and Attributes

# of Unique
# Index Values

1 id_1 3

ps There's a LOT to know about using an INDEX, but for this example what you see above should suffice. For more, consult the "king of the SAS index" ... Mike Raithel ...

The Basics of Using SAS Indexes

http://www2.sas.com/proceedings/sugi30/247-30.pdf

The Complete Guide to SAS Indexes

http://www.amazon.com/The-Complete-Guide-SAS-Indexes/dp/1590478495

CharlotteCain · Posted 10-14-2015 02:59 PM

@MikeZdeb Thank you so much Sir! for adding the comments. I really really appreciate. Sorry for the bother. Have a great day!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away