divergence between proc sql join and hash

mariopellegrini · Posted 02-23-2023 05:31 AM

Good morning everyone. I encountered an error in a conversion from a sql proc (left join) to a left join with hash table: I report the example showing that the TS_END_VALIDITA field at the end differs between the 2 join versions

data  OPE;
input cod_contratto:$15. NUM_REVISIONE TS_FINE_VALIDITA:datetime18.;
format TS_FINE_VALIDITA datetime18.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
;
data  AA;
input cod_contratto:$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
;
 
proc sql;
create table W1140ABS  as
select distinct OPE.COD_CONTRATTO 
               ,OPE.NUM_REVISIONE as PRG_REV_CONTR
               ,OPE.TS_FINE_VALIDITA format=datetime18. 
from AA
left join  OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;

data W1140ABS_hash;
if 0 then set OPE AA;
	 declare hash h_merge(dataset:"OPE");
	 rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
	 rc = h_merge.DefineData("TS_FINE_VALIDITA");
	 rc = h_merge.DefineDone();
drop rc;
do while (not eof);
   set AA  end=eof;
   rc = h_merge.find();  
   output;
end;
run;

Mario

Kurt_Bremser · Posted 02-23-2023 06:41 AM

The hash object allows only unique keys; While the dataset is read into the object, observations with duplicate keys are discarded.

This means that only the first obs from OPE is present in the object, and only its value makes it into the resulting dataset; the SQL first creates a cartesian join with 4 observations, two of which are discarded when DISTINCT is applied.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mariopellegrini · Posted 02-23-2023 08:51 AM

multidata option: "Y" in hash object declaration why doesn't it resolve? are there no other solutions?

Kurt_Bremser · Posted 02-23-2023 09:15 AM

You still have to iterate through the hash for all matches, and then apply an equivalent of the DISTINCT (e.g. use PROC SORT with NODUPKEY, or use a second hash to store all combinations you already wrote to the output).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 02-23-2023 09:46 AM

MULTIDATA='Y' will help. But then you have to do more work. To implement the DISTINCT feature you had in the SQL query then make a second HASH to keep track of which observations from the non-hashed dataset have already occured.

Also let's not rename one of the variables in the SQL so we can use PROC COMPARE to test the results.

While we are at it there is no need to attach the format again to the other variable. (And to get DATETIME to properly display the year you need a width of at least 19, some type of bug that SAS is not planning to fix).

And your SQL query makes no sense as a LEFT join since you are only selecting variables from the RIGHT table. So converted it to an INNER join.

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime18.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data  AA;
input cod_contratto :$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
y 24
;
 
proc sql;
create table W1140ABS  as
  select distinct
     OPE.COD_CONTRATTO 
    ,OPE.NUM_REVISIONE /* as PRG_REV_CONTR */
    ,OPE.TS_FINE_VALIDITA /* format=datetime18. */
  from AA
  inner join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;

data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData("TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do while(not rc);
      output;
      rc = h_merge.find_next();
    end;
  end;
  drop rc ;
run; 

proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;

mariopellegrini · Posted 02-23-2023 10:17 AM

Thank you very much ... a bit complex as a code! actually I would need a left join because in the original query I also have other variables and I have to get the exact rows of the left query. Can you modify the hash you wrote to bring back the left join? Thank you!

Tom · Posted 02-23-2023 11:16 AM

Sure. You need to add another variable to the LEFT dataset so you can see what you are joining.

So here is a cleaned up example:

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data  AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
y 24 1
;
 
proc sql;
create table W1140ABS  as
  select distinct
     aa.*
    ,OPE.TS_FINE_VALIDITA /* format=datetime19. */
  from AA
  left join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;

data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData("TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE", "extra");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  call missing(of TS_FINE_VALIDITA);
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do until(rc);
      output;
      rc = h_merge.find_next();
    end;
    put;
  end;
  drop rc ;
run; 

proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;

proc print data=W1140ABS;
run;

mariopellegrini · Posted 02-24-2023 03:11 AM

Thanks Tom. Looking closer, it was indeed an inner join. Seeing the original data, however, I still found something that doesn't add up, because the presence of additional fields creates double lines in the sql that disappear in the hash. Here is an example, I think it can be solved by introducing multidata="Y" also on the second hash:

data  OPE;
  input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
  format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
;
data AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
;
 
proc sql;
create table W1140ABS  as
  select distinct
     aa.cod_contratto,
	 AA.NUM_REVISIONE,
	 AA.extra,
     OPE.TS_FINE_VALIDITA
  from AA
  inner join  OPE
    ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
    and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;

 
data W1140ABS_hash;
  if 0 then set OPE AA;
  if _n_=1 then do;
    declare hash h_merge(dataset:"OPE",multidata:'Y');
    rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_merge.DefineData( "TS_FINE_VALIDITA");
    rc = h_merge.DefineDone();
    declare hash h_left();
    rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
    rc = h_left.DefineData("RC");
    rc = h_left.DefineDone();
  end;
  set AA  end=eof;
  if 0=h_left.add() then do;
    rc = h_merge.find(); 
    do while(not rc);
      output;
      rc = h_merge.find_next();
    end;
  end;
  drop rc ;
run;

divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Re: divergence between proc sql join and hash

Registration is open

SAS Training: Just a Click Away