Dear Experts,
I will have to join the following two tables.
Table one (1 billion rows)
Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
Table Two (32 million rows)
Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
I would like to get the joined table as below;
snapshotdate | sourcekey | sourcesystemid | value1 | NPL |
20200101 | 112 | 5 | 788 | 999 |
20200102 | 112 | 5 | 789 | 999 |
20200103 | 112 | 5 | 800 | 999 |
20200201 | 112 | 5 | 786 | 988 |
20200202 | 112 | 5 | 777 | 988 |
20200203 | 112 | 5 | 834 | 988 |
20200301 | 112 | 5 | 789 | 988 |
20200302 | 112 | 5 | 771 | 988 |
20200303 | 112 | 5 | 832 | 988 |
20200101 | 222 | 6 | 788 | 555 |
20200102 | 222 | 6 | 789 | 555 |
20200103 | 222 | 6 | 800 | 555 |
20200201 | 222 | 6 | 786 | 556 |
20200202 | 222 | 6 | 777 | 556 |
20200203 | 222 | 6 | 834 | 556 |
20200301 | 222 | 6 | 789 | 556 |
20200302 | 222 | 6 | 771 | 556 |
20200303 | 222 | 6 | 832 | 556 |
When there is a missing (year and month), it has to be filled with the latest available value (bolded),
The code I have currently: (not working to replace the missing values)
Proc SQL;
Create Table want as
Select
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;
Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two as HASH object.
Thanks in advance.
Regards,
Myu
Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.
If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:
Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data need / view=need;
set two;
month_begin_date = mdy(mod(period,100),1,floor(period/100));
format month_begin_date date9. ;
run;
data want (drop=month_begin_date);
set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
one (in=in1) ;
by sourcesystem sourcekey snapshotdate ;
if in2 then set need; /*Get the other variables, automatically retained */
if in1;
output;
if last.sourcekey then call missing(of _all_);
run;
Notes:
If they are sorted I would consider using MERGE statement.
Bart
try this:
Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data oneView / view = oneView;
set one;
period = floor(snapshotdate/100);
run;
data want;
merge oneView two;
by sourcesystemid sourcekey period;
_lag_npl_ = lag(npl);
if missing(npl) then npl = _lag_npl_;
drop _lag_npl_;
run;
Bart
Do the series in table 1 ALWAYS come in three's (2001,2002,2003)?
Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period". So the program is modified.
If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash. You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e. the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement. No hash needed:
Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
data need / view=need;
set two;
month_begin_date = mdy(mod(period,100),1,floor(period/100));
format month_begin_date date9. ;
run;
data want (drop=month_begin_date);
set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
one (in=in1) ;
by sourcesystem sourcekey snapshotdate ;
if in2 then set need; /*Get the other variables, automatically retained */
if in1;
output;
if last.sourcekey then call missing(of _all_);
run;
Notes:
A hash approach
data want(drop=period rc);
if _N_ = 1 then do;
dcl hash h(dataset : "two");
h.definekey("sourcekey", "sourcesystemid", "period");
h.definedata("npl");
h.definedone();
end;
set one;
if 0 then set two;
rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.