BookmarkSubscribeRSS Feed
mariopellegrini
Pyrite | Level 9

 

Good morning everyone. I encountered an error in a conversion from a sql proc (left join) to a left join with hash table: I report the example showing that the TS_END_VALIDITA field at the end differs between the 2 join versions

 

data  OPE;
input cod_contratto:$15. NUM_REVISIONE TS_FINE_VALIDITA:datetime18.;
format TS_FINE_VALIDITA datetime18.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
;
data  AA;
input cod_contratto:$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
;
 
proc sql;
create table W1140ABS  as
select distinct OPE.COD_CONTRATTO 
               ,OPE.NUM_REVISIONE as PRG_REV_CONTR
               ,OPE.TS_FINE_VALIDITA format=datetime18. 
from AA
left join  OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;

data W1140ABS_hash;
if 0 then set OPE AA;
	 declare hash h_merge(dataset:"OPE");
	 rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
	 rc = h_merge.DefineData("TS_FINE_VALIDITA");
	 rc = h_merge.DefineDone();
drop rc;
do while (not eof);
   set AA  end=eof;
   rc = h_merge.find();  
   output;
end;
run; 

Mario

 

7 REPLIES 7
Kurt_Bremser
Super User

The hash object allows only unique keys; While the dataset is read into the object, observations with duplicate keys are discarded.

This means that only the first obs from OPE is present in the object, and only its value makes it into the resulting dataset; the SQL first creates a cartesian join with 4 observations, two of which are discarded when DISTINCT is applied.

mariopellegrini
Pyrite | Level 9

multidata option: "Y" in hash object declaration why doesn't it resolve? are there no other solutions?

Kurt_Bremser
Super User

You still have to iterate through the hash for all matches, and then apply an equivalent of the DISTINCT (e.g. use PROC SORT with NODUPKEY, or use a second hash to store all combinations you already wrote to the output).

Tom
Super User Tom
Super User

MULTIDATA='Y' will help.  But then you have to do more work.  To implement the DISTINCT feature you had in the SQL query then make a second HASH to keep track of which observations from the non-hashed dataset have already occured.

 

Also let's not rename one of the variables in the SQL so we can use PROC COMPARE to test the results.

 

While we are at it there is no need to attach the format again to the other variable. (And to get DATETIME to properly display the year you need a width of at least 19, some type of bug that SAS is not planning to fix).

 

And your SQL query makes no sense as a LEFT join since you are only selecting variables from the RIGHT table.  So converted it to an INNER join.

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime18.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data  AA;
input cod_contratto :$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
y 24
;
 
proc sql;
create table W1140ABS  as
  select distinct
     OPE.COD_CONTRATTO 
    ,OPE.NUM_REVISIONE /* as PRG_REV_CONTR */
    ,OPE.TS_FINE_VALIDITA /* format=datetime18. */
  from AA
  inner join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;

data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData("TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do while(not rc);
      output;
      rc = h_merge.find_next();
    end;
  end;
  drop rc ;
run; 

proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;

mariopellegrini
Pyrite | Level 9

Thank you very much ... a bit complex as a code! actually I would need a left join because in the original query I also have other variables and I have to get the exact rows of the left query. Can you modify the hash you wrote to bring back the left join? Thank you!

Tom
Super User Tom
Super User

Sure.  You need to add another variable to the LEFT dataset so you can see what you are joining.

So here is a cleaned up example:

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data  AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
y 24 1
;
 
proc sql;
create table W1140ABS  as
  select distinct
     aa.*
    ,OPE.TS_FINE_VALIDITA /* format=datetime19. */
  from AA
  left join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;

data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData("TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE", "extra");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  call missing(of TS_FINE_VALIDITA);
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do until(rc);
      output;
      rc = h_merge.find_next();
    end;
    put;
  end;
  drop rc ;
run; 

proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;

proc print data=W1140ABS;
run;

mariopellegrini
Pyrite | Level 9

Thanks Tom. Looking closer, it was indeed an inner join. Seeing the original data, however, I still found something that doesn't add up, because the presence of additional fields creates double lines in the sql that disappear in the hash. Here is an example, I think it can be solved by introducing multidata="Y" also on the second hash:

 

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
;
data AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
;
 
proc sql;
create table W1140ABS  as
  select distinct
     aa.cod_contratto,
	 AA.NUM_REVISIONE,
	 AA.extra,
     OPE.TS_FINE_VALIDITA
  from AA
  inner join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;

 
data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData( "TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do while(not rc);
      output;
      rc = h_merge.find_next();
    end;
  end;
  drop rc ;
run; 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1974 views
  • 1 like
  • 3 in conversation