BookmarkSubscribeRSS Feed
J_CKY
Obsidian | Level 7

I was trying to do a many-to-many hash join. 

 

use sashelp.class as base table, I added extra row to get a data set with duplicated Name:

data work.class;
set sashelp.class end=eof;
if eof then do;
  Name = 'Alice'  ;
  Sex='F'  ;
  age=20  ;
  height=60  ;
  weight=159 ;
end;
run;

 

Then, I created a lookup table:

data work.lkup;
input name $ Var1 Var2;
datalines;
Alice  12  54
Jack   13  56
James  4   6
Alice  37  25
;
run;

 

I managed to created a left join by using the following code:

data work.test;
set work.class;

if _N_ =1 then do;
  if 0 then set lkup;
  declare hash h(dataset: 'lkup', hashExp:16 , multidata: 'y');
  h.defineKey('name');;
  h.defineData(all:'y');
  h.defineDone();
end;

if h.find() then do;
  call missing (of _all_);
output;
end;
if not h.find();output;
if not h.find_next();output;

run;

 

but the bolded call missing statement clear out all of the variable, both data from class and lkup. 

 

So instead of manually using the following code to clear out non-matched results

call missing (var1 var2);

is there any other way I can call missing var1 and var2 easily? (I am trying to apply the logic by using a lookup table with lots of variables.)

 

 

Thanks.

6 REPLIES 6
J_CKY
Obsidian | Level 7

I reckon I can add an array for all the variables from lkup and then call missing using the array. 

array lkup {*} var1 -- var3;
...
call missing (of lkup {*});
...


However, this is not ideal as I still have to define the first and last variables manually.  Better solutions are welcome.

 

Astounding
PROC Star

Technically, this will not work anyway since you can't mix numeric and character variables in the same array.  However, you can define two arrays, as long as you do it in the proper spot within the program:

 

array nums {*} _numeric_;

array chars {*} _character_;

 

The array statements must be placed early in the program.  Just before the initial SET statement, you can insert:

 

if 5=4 then do;.  

   set lkup;

   ** both array statements;

end;

 

That way, their definition will be limited to the variables in the lookup data set.  You will need two call missings later (one for nums{*} and one for chars{*}), and your program may still fail for a data set that contains only one type of variable.

J_CKY
Obsidian | Level 7
I also noticed that the

if not h.find_next(); output;

only return the 2nd result (if any) then, the 3rd result will be missed (if any);

what's the solution to this?
novinosrin
Tourmaline | Level 20

This should meet your req:

 

data work.test;

set work.class;

if _N_ =1 then do;

if 0 then  set lkup;

declare hash h(dataset: 'lkup', hashExp:16 , multidata: 'y');

h.defineKey('name');;

h.defineData(all:'y');

h.defineDone();

end;

rc = h.find();

if rc ne 0 then output;

if (rc = 0) then do;

output;

rc = h.find_next();

do while(rc = 0);

output;

rc = h.find_next();

end;

call missing(of v: );

end;

drop rc;

run;

 

 

mkeintz
PROC Star

The technique is to take advantage of how SAS builds the program data vector.  In the program below, the italic-bold syntax assures that variables from LKUP are between variables _pre and _post.  Then the modified call missing works as you apparently intend.

 

Note: any var that is on BOTH class and LKUP will not be set to missing, since its position in the PDV will be to the left of _pre.

 

 

data work.test (drop=_pre _post);
set work.class;

 

retain _pre .;
if _N_ =1 then do;
  if 0 then set lkup;
  declare hash h(dataset: 'lkup', hashExp:16 , multidata: 'y');
  h.defineKey('name');;
  h.defineData(all:'y');
  h.defineDone();
end;

retain _post .;

if h.find() then do;
  call missing (of _pre -- _post);
output;
end;
if not h.find();output;
if not h.find_next();output;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

In addition to my suggestion of building the program data vector to make call missing(of _pre -- _post) work as intended, I suggest you modify the code below,

 

from

      if h.find() then do;
        call missing (of _pre--_post_);

        output;
      end;
      if not h.find();output;
      if not h.find_next();output;

 

to

      call missing (of _pre--_post_);

      rc=h.find();

      output;

      do rc=h.find_next() by 0 while (rc=0);

         output;

         rc=h.find_next();

      end;

         
    It reduces the number of explicit output statments, and it accomodates cases in which a key value occurs more than twice.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1280 views
  • 0 likes
  • 4 in conversation