Dear Experts,
I will have to join the following two tables.
Table one (1 billion rows)
Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
Table Two (32 million rows)
Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
I would like to get the joined table as below;
snapshotdate | sourcekey | sourcesystemid | value1 | NPL |
20200101 | 112 | 5 | 788 | 999 |
20200102 | 112 | 5 | 789 | 999 |
20200103 | 112 | 5 | 800 | 999 |
20200201 | 112 | 5 | 786 | 988 |
20200202 | 112 | 5 | 777 | 988 |
20200203 | 112 | 5 | 834 | 988 |
20200301 | 112 | 5 | 789 | 988 |
20200302 | 112 | 5 | 771 | 988 |
20200303 | 112 | 5 | 832 | 988 |
20200101 | 222 | 6 | 788 | 555 |
20200102 | 222 | 6 | 789 | 555 |
20200103 | 222 | 6 | 800 | 555 |
20200201 | 222 | 6 | 786 | 556 |
20200202 | 222 | 6 | 777 | 556 |
20200203 | 222 | 6 | 834 | 556 |
20200301 | 222 | 6 | 789 | 556 |
20200302 | 222 | 6 | 771 | 556 |
20200303 | 222 | 6 | 832 | 556 |
When there is a missing (year and month), it has to be filled with the latest available value (bolded),
The code I have currently: (not working to replace the missing values)
Proc SQL;
Create Table want as
Select
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;
Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two as HASH object.
Thanks in advance.
Regards,
Myu
Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.
If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:
Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data need / view=need;
set two;
month_begin_date = mdy(mod(period,100),1,floor(period/100));
format month_begin_date date9. ;
run;
data want (drop=month_begin_date);
set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
one (in=in1) ;
by sourcesystem sourcekey snapshotdate ;
if in2 then set need; /*Get the other variables, automatically retained */
if in1;
output;
if last.sourcekey then call missing(of _all_);
run;
Notes:
If they are sorted I would consider using MERGE statement.
Bart
try this:
Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data oneView / view = oneView;
set one;
period = floor(snapshotdate/100);
run;
data want;
merge oneView two;
by sourcesystemid sourcekey period;
_lag_npl_ = lag(npl);
if missing(npl) then npl = _lag_npl_;
drop _lag_npl_;
run;
Bart
Do the series in table 1 ALWAYS come in three's (2001,2002,2003)?
Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.
If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:
Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data need / view=need;
set two;
month_begin_date = mdy(mod(period,100),1,floor(period/100));
format month_begin_date date9. ;
run;
data want (drop=month_begin_date);
set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
one (in=in1) ;
by sourcesystem sourcekey snapshotdate ;
if in2 then set need; /*Get the other variables, automatically retained */
if in1;
output;
if last.sourcekey then call missing(of _all_);
run;
Notes:
A hash approach
data want(drop=period rc);
if _N_ = 1 then do;
dcl hash h(dataset : "two");
h.definekey("sourcekey", "sourcesystemid", "period");
h.definedata("npl");
h.definedone();
end;
set one;
if 0 then set two;
rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.