Hi,
I have the following datasets:
/* Dataset 1 */
DATA table1;
LENGTH pcode $ 8
xcode $ 2
bcode $ 2;
INFILE DATALINES DLM='|' DSD;
INPUT pcode $ xcode $ bcode $;
DATALINES;
00000001|03|A
00000002|03|B
00000003|04|C1
00000004|05|D2
00000005|05|D3 /* This value is not in table2, but needs to be added */
RUN;
/* Dataset 2 */
DATA table2;
LENGTH pcode $ 8
xcode $ 2
xcode_key $ 2
bcode $ 2
bcode_key $ 2;
INFILE DATALINES DLM='|' DSD;
INPUT pcode $ xcode $ xcode_key $ bcode $ bcode_key;
DATALINES;
00000001|03|10|A|5
00000002|03|10|B|10
00000003|04|20|C1|15
00000004|05|30|D2|20
RUN;
I would like to use table1 as my hash lookup table so I can compare it to table2, but I want to keep all of the values from table1 and only those that match from table2. Is this possible?
If I were doing a PROC SQL, it would look like this:
PROC SQL;
SELECT
t1.pcode, t2.xcode, t2.bcode
FROM
table1 t1
LEFT JOIN
table2 t2 ON
t1.pcode = t2.pcode
AND t1.xcode = t2.xcode
AND t1.bcode = t2.bcode
QUIT;
I was able to implement a LEFT JOIN from table2 to table1 using the code below, but that's the opposite of what I want.
DATA want(DROP=xcode_key bcode_key);
LENGTH CODE_FOUND $ 1;
IF 0 THEN SET table1;
DECLARE HASH codes(DATASET: 'WORK.table1');
codes.DefineKey('pcode', 'xcode', 'bcode');
codes.DefineData(ALL: 'YES');
codes.DefineDone();
DO UNTIL (EOF);
SET table2 END = EOF;
IF codes.Find() = 0 THEN DO;
CODE_FOUND = 'T';
OUTPUT;
END;
ELSE DO;
CODE_FOUND = 'F';
CALL MISSING(OF pcode--bcode);
OUTPUT;
END;
END;
RUN;
I was hoping to perform the lookup and output this way because I'm trying to determine which values from table1 need to be added/removed from table2.
Here's a sample of the expected output:
pcode xcode bcode CODE_FOUND 00000001 03 A T 00000002 03 B T 00000003 04 C1 T 00000004 05 D2 T 00000005 05 D3 F
Thanks.
Interesting problem. The proc sql you posted has errors. Why do you want to use a hash-object? A merge should be ok:
data want;
merge table2(in=two) table1(in=one);
by pcode xcode bcode;
code_found = ifc(two and one, 'T', 'F');
keep pcode xcode bcode code_found;
run;
@error_prone wrote:Interesting problem. The proc sql you posted has errors. Why do you want to use a hash-object? A merge should be ok
I'm familiar with the merge way, but was wanting to use a hash object because everything I've read indicates that it is super fast and I've been wanting to learn more about hash objects.
While my post shows small data sets, I do have instances where I'm working with lookup tables that could be much larger and comparing them to huge data sets.
Also, what are the errors in the SQL code?
Thanks.
Error-message:
ERROR: Ambiguous reference, column xcode is in more than one table.
ERROR: Ambiguous reference, column pcode is in more than one table.
ERROR: Ambiguous reference, column bcode is in more than one table.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
Could not get it to work with table1 as lookup.
/* Dataset 1 */
DATA table1;
LENGTH pcode $ 8
xcode $ 2
bcode $ 2;
INFILE DATALINES DLM='|' DSD;
INPUT pcode $ xcode $ bcode $;
DATALINES;
00000001|03|A
00000002|03|B
00000003|04|C1
00000004|05|D2
00000005|05|D3 /* This value is not in table2, but needs to be added */
;
RUN;
/* Dataset 2 */
DATA table2;
LENGTH pcode $ 8
xcode $ 2
xcode_key $ 2
bcode $ 2
bcode_key $ 2;
INFILE DATALINES DLM='|' DSD;
INPUT pcode $ xcode $ xcode_key $ bcode $ bcode_key;
DATALINES;
00000001|03|10|A|5
00000002|03|10|B|10
00000003|04|20|C1|15
00000004|05|30|D2|20
00000007|05|42|XX|42 /* should not be in want-dataset */
;
RUN;
data want;
if 0 then set table2;
if _n_ = 1 then do;
declare hash h(dataset: "table2");
h.defineKey("pcode", "xcode", "bcode");
/*h.defineData(all: "yes"); no need to define data-elements */
h.defineDone();
end;
set table1;
CODE_FOUND = ifc(h.check() = 0, "T", "F");
run;
Firstly, what are your key part and data part of the hash table. My guess is that pcode and xcode are the key part. If this is true, then read further.
A way out for using Table1 as part of hash table is to add a FLAG to the data part of the hash table with value of 1. As your Table2
matches the key in the hash table, mark the flag as zero and replace the hash item. This way, we can identify which hash item is not matched to keys of Table2 when the FLAG having value of ONE.
Hope this solution meets your requirements.
data want;
if _n_ = 1 then do;
flag = 1;
declare hash h();
h.definekey('pcode','xcode');
h.definedata('pcode','xcode','bcode','flag');
h.definedone();
declare hiter hi('h');
do until(last);
set table1 end = last;
h.add();
end;
end;
do until(eof);
set table2 end = eof;
if h.find() = 0 then do;
output;
flag = 0;
h.replace();
end;
end;
do while(hi.next() = 0);
call missing(xcode_key,bcode_key);
if flag = 1 then output;
end;
drop flag;
run;
Obs pcode xcode bcode xcode_key bcode_key
1 00000001 03 A 10 5
2 00000002 03 B 10 10
3 00000003 04 C1 20 15
4 00000004 05 D2 30 20
5 00000005 05 D3
Here is the code I wrote a couple of years ago to mock JOIN by hash table.
Hope could help you a little bit.
Use hash table to simulate JOIN is a real tough task. You need consider many scenarios .
data tab1;
input a_var1 a_value;
datalines;
1 11
1 11
2 22
3 33
9 88
run;
data tab2;
input b_var1 b_value;
datalines;
1 11
2 22
3 99
4 999
run;
data want;
if _n_ eq 1 then do;
if 0 then set tab2;
declare hash h(dataset:'tab2',multidata:'y');
declare hiter hi('h');
h.definekey('b_var1');
h.definedata('b_var1','b_value');
h.definedone();
end;
set tab1 end=last;
by a_var1;
if first.a_var1 then call missing(b_var1,b_value);
if h.find(key:a_var1)=0 then h.removedup(key:a_var1);
output;
if last.a_var1 then do;
rc=h.find(key:a_var1);
do while(rc=0);
output;
rc=h.find_next(key:a_var1);
end;
end;
if last then do;
call missing(of _all_);
do while(hi.next()=0);
output;
end;
end;
drop rc;
run;
proc print noobs;run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.