duplicate value in hash table

mariopellegrini · Posted 04-28-2023 06:34 AM

I have a problem of not converting between a join made with proc sql and the same join made with sas hash table. There is a record that is being duplicated in the sql proc, but not with the use of hashes. I give the example:

data RUOLO_DANNI;
input cod_source:$2. cod_anagrafico:$9. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 29JUL2019:12:14:09 29JUL2019:12:14:55
1 1044576 29JUL2019:12:14:56 01JAN5999:00:00:00
1 1044576 29JUL2019:12:14:02 01JAN5999:00:00:00
;
data T_A_SOGGETTI_DN;
input cod_source:$2. cod_anagrafico:$9. cod_anagrafico_edwh:$20. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 F_LLSSTR03A61D612H 29JUL2019:12:14:01 01JAN5999:00:00:00
1 1044576 F_LLSSTR03A61D612H 29JUL2019:12:14:08 29JUL2019:12:14:00
;
/*SQL*/
proc sql noprint;
create table ruolo_danni_NEW as 
select RU.*,
       SO.cod_anagrafico_EDWH,
	   SO.ts_fine_validita as tsfv
from RUOLO_DANNI RU
left join T_A_SOGGETTI_DN SO
   on        
RU.cod_anagrafico = SO.cod_anagrafico
and ru.ts_inizio_validita between so.ts_inizio_validita and so.ts_fine_validita
and compress(SO.cod_source)='1';
quit;
/*HASH*/
data ruolo_danni_HASH;
set ruolo_danni;
if _n_ = 1
then do;
  length ti tf 8;
  length cod_anagrafico_edwh $20.;
  declare hash so (multidata:'y',dataset:"t_a_soggetti_dn (where=(compress(cod_source)='1') 
                            rename=(ts_inizio_validita=ti ts_fine_validita=tf))");
  so.definekey("cod_anagrafico");
  so.definedata("cod_anagrafico_edwh","ti","tf");
  so.definedone();
end;
rc = so.find();
if rc >= 0;
if ti le ts_inizio_validita le tf then cod_anagrafico_edwh2=cod_anagrafico_edwh; else cod_anagrafico_edwh2="";
rename cod_source2=cod_source cod_anagrafico_edwh2=cod_anagrafico_edwh;
drop rc ;
run;

AhmedAl_Attar · Posted 04-28-2023 06:58 AM

Hi @mariopellegrini

Have a look at this paper Using the SAS® Hash Object with Duplicate Key Entries

It may provide with more explanation.

Hope this helps

mariopellegrini · Posted 04-28-2023 10:31 AM

I tried but I can't get the result in my specific case

Patrick · Posted 04-29-2023 09:52 PM

@mariopellegrini wrote:

I tried but I can't get the result in my specific case

That's eventually because of your use of the SQL Between Operator.

BETWEEN Operator

With your sample data above statement is important. I assume the result your SQL returns is not what you intended to get.

data master;
  input m_id m_group_id:$1. m_date:datetime20.;
  format m_date datetime20.;
  datalines;
1 a 29JUL2019:12:14:09
2 a 29JUL2019:12:14:56
3 a 29JUL2019:12:14:02
;

data lookup;
  input l_id l_group_id:$1. l_start_dt:datetime20.  l_end_dt:datetime20.;
  format l_start_dt l_end_dt datetime20.;
  datalines;
1 a 29JUL2019:12:14:01 01JAN5999:00:00:00
2 a 29JUL2019:12:14:08 29JUL2019:12:14:00
;

proc sql;
    select 
      m_id,
      l_id,
      l_start_dt,
      m_date,
      l_end_dt,
      (m_date between l_start_dt and l_end_dt)      as check_logic_1,
      (m_date >= l_start_dt and m_date <= l_end_dt) as check_logic_2
    from master m
      left join lookup l
        on        
        m_group_id = l_group_id
        and m_date between l_start_dt and l_end_dt
    order by m_id, l_id
    ;
quit;

Assuming you don't want to select row 4 a data step version could look like below:

data hashlookup_version;
  if _n_=1 then
    do;
      if _n_=1 then set lookup;
      dcl hash h1(dataset:'lookup' multidata:'y');
      h1.defineKey('l_group_id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;
  call missing(of _all_);

  set master;

  do while(h1.do_over(key:strip(m_group_id))=0);
/*    if m_date >= l_start_dt and m_date <= l_end_dt then output;*/
    if l_start_dt <= m_date <= l_end_dt then output;
  end;
run;

proc print data=hashlookup_version;
  var
    m_id
    l_id
    l_start_dt
    m_date
    l_end_dt
    ;
run;

mariopellegrini · Posted 04-30-2023 03:30 AM

Thanks, you were very clear.
I think from what you're saying that the between operator of proc sql is a bit ambiguous...
so to get an equivalent result equivalence I should write this code, is it correct?

data hashlookup_version;
  if _n_=1 then
    do;
      if _n_=1 then set lookup;
      dcl hash h1(dataset:'lookup', multidata:'y');
      h1.defineKey('l_group_id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;
  call missing(of _all_);

  set master;

  do while(h1.do_over(key:strip(m_group_id))=0);
    if (l_start_dt <= m_date <= l_end_dt) or
       (l_end_dt   <= m_date <= l_start_dt)

then output;
  end;
run;

Patrick · Posted 04-30-2023 05:06 AM

Pretty much, yes. But I actually believe that you shouldn't be using the between operator in your SQL as it likely doesn't return the desired result.

IF you want to mimic your current SQL where clause then the matching data step code should look like:

data ruolo_danni_NEW_hash;
  set ruolo_danni;

  if _n_ = 1 then
    do;
      length ti tf 8;
      length cod_anagrafico_edwh $20.;
      declare hash so (multidata:'y',dataset:"t_a_soggetti_dn (where=(compress(cod_source)='1') 
        rename=(ts_inizio_validita=ti ts_fine_validita=tf))");
      so.definekey("cod_anagrafico");
      so.definedata("cod_anagrafico_edwh","ti","tf");
      so.definedone();
    end;
  
  do while(so.do_over()=0);
    if 
      ( 
        ti<=ts_inizio_validita<= tf 
        or 
        tf<=ts_inizio_validita<= ti 
      ) 
      and compress(cod_source)='1'
      then
      output;
  end;

run;

mariopellegrini · Posted 04-30-2023 01:13 PM

Patrick, I've tried the code with other data and the results no longer match the sql. This conversion of sql to hash is becoming a puzzle...

data RUOLO_DANNI;
input cod_source:$2. cod_anagrafico:$9. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 18JUN2019:17:25:34 18JUN2019:17:25:40
1 1044576 18JUN2019:17:25:34 18JUN2019:17:25:40
1 1044576 18JUN2019:17:25:41 18JUN2019:17:38:53
;
data T_A_SOGGETTI_DN;
input cod_source:$2. cod_anagrafico:$9. cod_anagrafico_edwh:$20. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 F_LLSSTR03A61D612H 08APR2019:12:54:26 17JUN2019:09:49:53
1 1044576 F_LLSSTR03A61D612H 17JUN2019:09:49:54 18JUN2019:17:25:32
;
/*SQL*/
proc sql noprint;
create table ruolo_danni_SQL as 
select RU.*,
       SO.cod_anagrafico_EDWH,
	   SO.ts_fine_validita as tsfv
from RUOLO_DANNI RU
left join T_A_SOGGETTI_DN SO
   on        
RU.cod_anagrafico = SO.cod_anagrafico
and ru.ts_inizio_validita between so.ts_inizio_validita and so.ts_fine_validita
and compress(SO.cod_source)='1';
quit;
/*HASH*/
data ruolo_danni_hash;
  set ruolo_danni;

  if _n_ = 1 then
    do;
      length ti tf 8;
      length cod_anagrafico_edwh $20.;
      declare hash so (multidata:'y',dataset:"t_a_soggetti_dn (where=(compress(cod_source)='1') 
        rename=(ts_inizio_validita=ti ts_fine_validita=tf))");
      so.definekey("cod_anagrafico");
      so.definedata("cod_anagrafico_edwh","ti","tf");
      so.definedone();
    end;
  do while(so.do_over()=0);
    if 
      ( 
        ti<=ts_inizio_validita<= tf 
        or 
        tf<=ts_inizio_validita<= ti 
      ) 
      and compress(cod_source)='1'
      then
      output;
  end;
run;

Patrick · Posted 04-30-2023 10:00 PM

@mariopellegrini

Ah, yes... the data step code provided so far mimics an inner join. You would have to amend the logic for a left join.

mariopellegrini · Posted 05-01-2023 03:57 AM

Okay, thanks Patrick. Now by applying the code to numerous data, the number of final rows corresponds, but I encounter 2 problems:

1) the date filter doesn't seem to work, as shown by the put I inserted, from which it can be seen in the log that it doesn't print, therefore it doesn't enter the "if".

data RUOLO_DANNI;
input cod_source:$1. cod_anagrafico:$9. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 28JUN2019:10:57:45 28JUN2019:10:57:51
1 1044576 28JUN2019:10:57:45 28JUN2019:10:57:51
1 1044576 28JUN2019:10:57:52 28JUN2019:10:58:36
;
data T_A_SOGGETTI_DN;
input cod_source:$2. cod_anagrafico:$9. cod_anagrafico_edwh:$20. ts_inizio_validita:datetime20.  ts_fine_validita:datetime20.;
format ts_inizio_validita ts_fine_validita datetime20.;
datalines;
1 1044576 F_LLSSTR03A61D612H 12APR2019:03:32:12 29MAY2019:13:22:11
1 1044576 F_LLSSTR03A61D612H 29MAY2019:13:22:12 29MAY2019:13:22:45
;
/*SQL*/
proc sql noprint;
create table ruolo_danni_SQL as 
select RU.*,
       SO.cod_anagrafico_EDWH
from RUOLO_DANNI RU
left join T_A_SOGGETTI_DN SO
   on        
RU.cod_anagrafico = SO.cod_anagrafico
and ru.ts_inizio_validita between so.ts_inizio_validita and so.ts_fine_validita
and compress(SO.cod_source)='1';
quit;
/*HASH*/
data ruolo_danni_hash;
  set ruolo_danni;
  
  if _n_ = 1 then
    do;
      length ti tf 8;
      length cod_anagrafico_edwh $20.;
	  format ti tf datetime20.;
      declare hash so (multidata:'y',dataset:"t_a_soggetti_dn (where=(compress(cod_source)='1') 
        rename=(ts_inizio_validita=ti ts_fine_validita=tf))");
      so.definekey("cod_anagrafico");
      so.definedata("cod_anagrafico_edwh","ti","tf");
      so.definedone();
    end;
  do while(so.do_over()=0);
    if 
      ( 
        ti<=ts_inizio_validita<= tf 
        or 
        tf<=ts_inizio_validita<= ti 
      ) 
      and compress(cod_source)='1'
        then do; put cod_anagrafico=; _true_flg='1';output; end;
  end;
  if _true_flg ne '1' then output;
  drop _true_flg ;
run;

2) the processing time compared to proc sql has increased rather than decreased (2:08.60 versus 1:56.14), it would make the conversion of this step to a hash useless:

NOTE: PROCEDURE SQL ha utilizzato (tempo totale di elaborazione):
real time 1:56.14
user cpu time 1:41.01
system cpu time 23.65 seconds
memory 9414131.15k
OS Memory 9432964.00k
Timestamp 01/05/2023 08:14:10 m.
Step Count 10 Switch Count 85

----------------------------------------------------------------------------

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
NOTE: Variable ti is uninitialized.
NOTE: Variable tf is uninitialized.
NOTE: Variable cod_anagrafico_edwh is uninitialized.
NOTE: There were 6927902 observations read from the data set EDWH_ODS.T_A_SOGGETTI_DN.
WHERE COMPRESS(cod_source)='1';
NOTE: There were 66878514 observations read from the data set STG_VT.RUOLO_DANNI.
NOTE: The data set WORK.RUOLO_DANNI_HASH2 has 66878637 observations and 11 variables.
NOTE: DATA statement ha utilizzato (tempo totale di elaborazione):
real time 2:08.60
user cpu time 1:07.26
system cpu time 13.15 seconds
memory 1252530.56k
OS Memory 1275076.00k
Timestamp 01/05/2023 08:16:18 m.
Step Count 11 Switch Count 208

Patrick · Posted 05-01-2023 05:03 AM

If the data is already pre-sorted by your key then using a data step hash might not help with run-times and it also won't make your code easier to read and maintain.

And even if you would gain a bit of a performance gain I wouldn't go for mimicking a SQL left join via a data step hash unless the performance gain is really critical.

I consider for real implementation code that's easy to understand and maintain as more important than performance. I only go for more complicated better performant code over simplicity if it's really necessary.

mariopellegrini · Posted 05-01-2023 08:22 AM

OK, I agree. However, out of curiosity, I would like to understand if it is possible to arrive at an exact conversion and solve question n.1 concerning the filter on "ts_beginning_validity". I don't understand why it doesn't work

duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Re: duplicate value in hash table

Registration is open

SAS Training: Just a Click Away