Solved: Re: Left join in HASH with missing values to be replaced with latest a...

Myurathan · Posted 07-27-2020 10:51 AM

Dear Experts,

I will have to join the following two tables.

Table one (1 billion rows)

Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Table Two (32 million rows)

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

I would like to get the joined table as below;

snapshotdate	sourcekey	sourcesystemid	value1	NPL
20200101	112	5	788	999
20200102	112	5	789	999
20200103	112	5	800	999
20200201	112	5	786	988
20200202	112	5	777	988
20200203	112	5	834	988
20200301	112	5	789	988
20200302	112	5	771	988
20200303	112	5	832	988
20200101	222	6	788	555
20200102	222	6	789	555
20200103	222	6	800	555
20200201	222	6	786	556
20200202	222	6	777	556
20200203	222	6	834	556
20200301	222	6	789	556
20200302	222	6	771	556
20200303	222	6	832	556

When there is a missing (year and month), it has to be filled with the latest available value (bolded),

The code I have currently: (not working to replace the missing values)

Proc SQL;
Create Table want as
Select 
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;

Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two as HASH object.

Thanks in advance.

Regards,

Myu

mkeintz · Posted 07-27-2020 12:19 PM

Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.

If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:

Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data need / view=need;
  set two;
  month_begin_date = mdy(mod(period,100),1,floor(period/100));
  format month_begin_date date9. ;
run;

data want (drop=month_begin_date);
  set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
      one (in=in1) ;
  by sourcesystem sourcekey snapshotdate ;
  if in2 then set need; /*Get the other variables, automatically retained */
  if in1;
  output;
  if last.sourcekey then call missing(of _all_);
run;

Notes:

DATA NEED is a data set VIEW, not a data set FILE. So it never gets written to disk. Instead it is activated (sort of like instantiating a hash object) only when it is referred to in a later step. So basically it's piping the data from table TWO to the data want step.
The first set/by reads only the relevant sort variable from need. And only when the incoming observation gets a need (i.e. table two) record, does it "set" the other varibles in need. The advantage here is that those other variables (sourcesystemid npl), since they are only read by this conditional set, are automatically retained until the next time this conditional set is executed - so missing periods in table 2 just inherit the most recent table 2 values.
I.e this is "last observation carried forward".
The explicit OUTPUT statement followed by the "if last.sourcekey then call missig(of _all_)" is a way to prevent ~~sourcesystemid and~~ npl from being carried forward from one sourcekey to the next. Now if your table 2 ALWAYS has a monthly record matching or preceding the first corresponding sourcekey in table 1, you wouldn't need those two statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

yabwon · Posted 07-27-2020 11:45 AM

Are those tables sorted by sourcekey and snapshotdate/period ?

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Myurathan · Posted 07-27-2020 11:52 AM

Yes, both tables are sorted by sourcekey, sourcesystemid, and snapshotperiod/period

yabwon · Posted 07-27-2020 12:03 PM

If they are sorted I would consider using MERGE statement.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Myurathan · Posted 07-27-2020 12:10 PM

@yabwon, I am not much used to merge statement. Could you please help me with the code. Thanks in advance.

yabwon · Posted 07-27-2020 12:19 PM

try this:

Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data oneView / view = oneView;
  set one;
  period = floor(snapshotdate/100);
run;

data want;
  merge oneView two;
  by sourcesystemid sourcekey period;
  _lag_npl_ = lag(npl);
  if missing(npl) then npl = _lag_npl_;
  drop _lag_npl_;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

mkeintz · Posted 07-27-2020 11:50 AM

Do the series in table 1 ALWAYS come in three's (2001,2002,2003)?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Myurathan · Posted 07-27-2020 11:53 AM

@mkeintz, It will have monthly information for many years.

mkeintz · Posted 07-27-2020 12:19 PM

Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.

If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:

Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data need / view=need;
  set two;
  month_begin_date = mdy(mod(period,100),1,floor(period/100));
  format month_begin_date date9. ;
run;

data want (drop=month_begin_date);
  set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
      one (in=in1) ;
  by sourcesystem sourcekey snapshotdate ;
  if in2 then set need; /*Get the other variables, automatically retained */
  if in1;
  output;
  if last.sourcekey then call missing(of _all_);
run;

Notes:

DATA NEED is a data set VIEW, not a data set FILE. So it never gets written to disk. Instead it is activated (sort of like instantiating a hash object) only when it is referred to in a later step. So basically it's piping the data from table TWO to the data want step.
The first set/by reads only the relevant sort variable from need. And only when the incoming observation gets a need (i.e. table two) record, does it "set" the other varibles in need. The advantage here is that those other variables (sourcesystemid npl), since they are only read by this conditional set, are automatically retained until the next time this conditional set is executed - so missing periods in table 2 just inherit the most recent table 2 values.
I.e this is "last observation carried forward".
The explicit OUTPUT statement followed by the "if last.sourcekey then call missig(of _all_)" is a way to prevent ~~sourcesystemid and~~ npl from being carried forward from one sourcekey to the next. Now if your table 2 ALWAYS has a monthly record matching or preceding the first corresponding sourcekey in table 1, you wouldn't need those two statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PeterClemmensen · Posted 07-27-2020 02:42 PM

A hash approach

data want(drop=period rc);
   if _N_ = 1 then do;
      dcl hash h(dataset : "two");
      h.definekey("sourcekey", "sourcesystemid", "period");
      h.definedata("npl");
      h.definedone();
   end;

   set one;
   if 0 then set two;

   rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;

The DATA to DATA Step Macro
Blog: SASnrd

Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

Re: Left join in HASH with missing values to be replaced with latest available value

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away