Good morning everyone. I encountered an error in a conversion from a sql proc (left join) to a left join with hash table: I report the example showing that the TS_END_VALIDITA field at the end differs between the 2 join versions
data OPE;
input cod_contratto:$15. NUM_REVISIONE TS_FINE_VALIDITA:datetime18.;
format TS_FINE_VALIDITA datetime18.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
;
data AA;
input cod_contratto:$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
;
proc sql;
create table W1140ABS as
select distinct OPE.COD_CONTRATTO
,OPE.NUM_REVISIONE as PRG_REV_CONTR
,OPE.TS_FINE_VALIDITA format=datetime18.
from AA
left join OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;
data W1140ABS_hash;
if 0 then set OPE AA;
declare hash h_merge(dataset:"OPE");
rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_merge.DefineData("TS_FINE_VALIDITA");
rc = h_merge.DefineDone();
drop rc;
do while (not eof);
set AA end=eof;
rc = h_merge.find();
output;
end;
run;
Mario
The hash object allows only unique keys; While the dataset is read into the object, observations with duplicate keys are discarded.
This means that only the first obs from OPE is present in the object, and only its value makes it into the resulting dataset; the SQL first creates a cartesian join with 4 observations, two of which are discarded when DISTINCT is applied.
multidata option: "Y" in hash object declaration why doesn't it resolve? are there no other solutions?
You still have to iterate through the hash for all matches, and then apply an equivalent of the DISTINCT (e.g. use PROC SORT with NODUPKEY, or use a second hash to store all combinations you already wrote to the output).
MULTIDATA='Y' will help. But then you have to do more work. To implement the DISTINCT feature you had in the SQL query then make a second HASH to keep track of which observations from the non-hashed dataset have already occured.
Also let's not rename one of the variables in the SQL so we can use PROC COMPARE to test the results.
While we are at it there is no need to attach the format again to the other variable. (And to get DATETIME to properly display the year you need a width of at least 19, some type of bug that SAS is not planning to fix).
And your SQL query makes no sense as a LEFT join since you are only selecting variables from the RIGHT table. So converted it to an INNER join.
data OPE;
input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime18.;
format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data AA;
input cod_contratto :$15. NUM_REVISIONE;
datalines;
100004790 35
100004790 35
y 24
;
proc sql;
create table W1140ABS as
select distinct
OPE.COD_CONTRATTO
,OPE.NUM_REVISIONE /* as PRG_REV_CONTR */
,OPE.TS_FINE_VALIDITA /* format=datetime18. */
from AA
inner join OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;
data W1140ABS_hash;
if 0 then set OPE AA;
if _n_=1 then do;
declare hash h_merge(dataset:"OPE",multidata:'Y');
rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_merge.DefineData("TS_FINE_VALIDITA");
rc = h_merge.DefineDone();
declare hash h_left();
rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_left.DefineData("RC");
rc = h_left.DefineDone();
end;
set AA end=eof;
if 0=h_left.add() then do;
rc = h_merge.find();
do while(not rc);
output;
rc = h_merge.find_next();
end;
end;
drop rc ;
run;
proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;
Thank you very much ... a bit complex as a code! actually I would need a left join because in the original query I also have other variables and I have to get the exact rows of the left query. Can you modify the hash you wrote to bring back the left join? Thank you!
Sure. You need to add another variable to the LEFT dataset so you can see what you are joining.
So here is a cleaned up example:
data OPE;
input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
100004790 35 14Sep2022:00:00:00
x 12
;
data AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
y 24 1
;
proc sql;
create table W1140ABS as
select distinct
aa.*
,OPE.TS_FINE_VALIDITA /* format=datetime19. */
from AA
left join OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE
;
quit;
data W1140ABS_hash;
if 0 then set OPE AA;
if _n_=1 then do;
declare hash h_merge(dataset:"OPE",multidata:'Y');
rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_merge.DefineData("TS_FINE_VALIDITA");
rc = h_merge.DefineDone();
declare hash h_left();
rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE", "extra");
rc = h_left.DefineData("RC");
rc = h_left.DefineDone();
end;
set AA end=eof;
call missing(of TS_FINE_VALIDITA);
if 0=h_left.add() then do;
rc = h_merge.find();
do until(rc);
output;
rc = h_merge.find_next();
end;
put;
end;
drop rc ;
run;
proc compare data=W1140ABS compare=W1140ABS_hash listall;
run;
proc print data=W1140ABS;
run;
Thanks Tom. Looking closer, it was indeed an inner join. Seeing the original data, however, I still found something that doesn't add up, because the presence of additional fields creates double lines in the sql that disappear in the hash. Here is an example, I think it can be solved by introducing multidata="Y" also on the second hash:
data OPE;
input cod_contratto :$15. NUM_REVISIONE TS_FINE_VALIDITA :datetime.;
format TS_FINE_VALIDITA datetime19.;
datalines;
100004790 35 01Jan1999:00:00:00
;
data AA;
input cod_contratto :$15. NUM_REVISIONE extra ;
datalines;
100004790 35 1
100004790 35 2
;
proc sql;
create table W1140ABS as
select distinct
aa.cod_contratto,
AA.NUM_REVISIONE,
AA.extra,
OPE.TS_FINE_VALIDITA
from AA
inner join OPE
ON AA.COD_CONTRATTO = OPE.COD_CONTRATTO
and OPE.NUM_REVISIONE = AA.NUM_REVISIONE;
quit;
data W1140ABS_hash;
if 0 then set OPE AA;
if _n_=1 then do;
declare hash h_merge(dataset:"OPE",multidata:'Y');
rc = h_merge.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_merge.DefineData( "TS_FINE_VALIDITA");
rc = h_merge.DefineDone();
declare hash h_left();
rc = h_left.DefineKey("COD_CONTRATTO", "NUM_REVISIONE");
rc = h_left.DefineData("RC");
rc = h_left.DefineDone();
end;
set AA end=eof;
if 0=h_left.add() then do;
rc = h_merge.find();
do while(not rc);
output;
rc = h_merge.find_next();
end;
end;
drop rc ;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.