BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Myurathan
Quartz | Level 8

Dear Experts,

 

I will have to join the following two tables.

Table one (1 billion rows)

Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Table Two (32 million rows)

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

I would like to get the joined table as below;

snapshotdatesourcekeysourcesystemidvalue1NPL
202001011125788999
202001021125789999
202001031125800999
202002011125786988
202002021125777988
202002031125834988
202003011125789988
202003021125771988
202003031125832988
202001012226788555
202001022226789555
202001032226800555
202002012226786556
202002022226777556
202002032226834556
202003012226789556
202003022226771556
202003032226832556

 

When there is a missing (year and month), it has to be filled with the latest available value (bolded), 

The code I have currently: (not working to replace the missing values)

Proc SQL;
Create Table want as
Select 
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;

Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two  as HASH object. 

 

Thanks in advance. 

 

Regards,

Myu

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period".  So the program is modified.

 

If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash.  You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e.  the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement.  No hash needed:

 

Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data need / view=need;
  set two;
  month_begin_date = mdy(mod(period,100),1,floor(period/100));
  format month_begin_date date9. ;
run;

data want (drop=month_begin_date);
  set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
      one (in=in1) ;
  by sourcesystem sourcekey snapshotdate ;
  if in2 then set need; /*Get the other variables, automatically retained */
  if in1;
  output;
  if last.sourcekey then call missing(of _all_);
run;

Notes:

 

  1. DATA NEED is a data set VIEW, not a data set FILE.  So it never gets written to disk.  Instead it is activated (sort of like instantiating a hash object) only when it is referred to in a later step.  So basically it's piping the data from table TWO to  the data want step.
  2. The first set/by reads only the relevant sort variable from need.  And only when the incoming observation gets a need (i.e. table two) record, does it "set" the other varibles in need.  The advantage here is that those other variables (sourcesystemid npl), since they are only read by this conditional set, are automatically retained until the next time this conditional set is executed - so missing periods in table 2 just inherit the most recent table 2 values.
         I.e this is "last observation carried forward".
  3. The explicit OUTPUT statement followed by the "if last.sourcekey then call missig(of _all_)" is a way to prevent sourcesystemid and npl from being carried forward from one sourcekey to the next.  Now if your table 2 ALWAYS has a monthly record matching or preceding the first corresponding sourcekey in table 1, you wouldn't need those two statements.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

9 REPLIES 9
yabwon
Onyx | Level 15
Are those tables sorted by sourcekey and snapshotdate/period ?

Bart
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Myurathan
Quartz | Level 8
Yes, both tables are sorted by sourcekey, sourcesystemid, and snapshotperiod/period
yabwon
Onyx | Level 15

If they are sorted I would consider using MERGE statement.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Myurathan
Quartz | Level 8
@yabwon, I am not much used to merge statement. Could you please help me with the code. Thanks in advance.
yabwon
Onyx | Level 15

try this:

Data one;
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data oneView / view = oneView;
  set one;
  period = floor(snapshotdate/100);
run;

data want;
  merge oneView two;
  by sourcesystemid sourcekey period;
  _lag_npl_ = lag(npl);
  if missing(npl) then npl = _lag_npl_;
  drop _lag_npl_;
run;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



mkeintz
PROC Star

Do the series in table 1 ALWAYS come in three's  (2001,2002,2003)?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Myurathan
Quartz | Level 8
@mkeintz, It will have monthly information for many years.
mkeintz
PROC Star

Editted note: just saw the info that data are sorted "by sourcekey, sourcesystemid, and snapshotperiod/period".  So the program is modified.

 

If the data are sorted by sourcekey/snapshotdate (table1) and sourcekey/period (table 2), then you don't need hash.  You can use a "last-observation-carried-forward" technique in which you (1) assign a dummy date to the period in table2 (i.e.  the first date of the month specified by period, and (2) use a SET/BY combination, complemented by a conditional SET statement.  No hash needed:

 

Data one;
input snapshotdate yymmdd8. sourcekey sourcesystemid value1;
format snapshotdate date9. ;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

data need / view=need;
  set two;
  month_begin_date = mdy(mod(period,100),1,floor(period/100));
  format month_begin_date date9. ;
run;

data want (drop=month_begin_date);
  set need (keep=sourcekey sourcesystem month_begin_date rename=(month_begin_date=snapshotdate) in=in2)
      one (in=in1) ;
  by sourcesystem sourcekey snapshotdate ;
  if in2 then set need; /*Get the other variables, automatically retained */
  if in1;
  output;
  if last.sourcekey then call missing(of _all_);
run;

Notes:

 

  1. DATA NEED is a data set VIEW, not a data set FILE.  So it never gets written to disk.  Instead it is activated (sort of like instantiating a hash object) only when it is referred to in a later step.  So basically it's piping the data from table TWO to  the data want step.
  2. The first set/by reads only the relevant sort variable from need.  And only when the incoming observation gets a need (i.e. table two) record, does it "set" the other varibles in need.  The advantage here is that those other variables (sourcesystemid npl), since they are only read by this conditional set, are automatically retained until the next time this conditional set is executed - so missing periods in table 2 just inherit the most recent table 2 values.
         I.e this is "last observation carried forward".
  3. The explicit OUTPUT statement followed by the "if last.sourcekey then call missig(of _all_)" is a way to prevent sourcesystemid and npl from being carried forward from one sourcekey to the next.  Now if your table 2 ALWAYS has a monthly record matching or preceding the first corresponding sourcekey in table 1, you wouldn't need those two statements.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PeterClemmensen
Tourmaline | Level 20

A hash approach

 

data want(drop=period rc);
   if _N_ = 1 then do;
      dcl hash h(dataset : "two");
      h.definekey("sourcekey", "sourcesystemid", "period");
      h.definedata("npl");
      h.definedone();
   end;

   set one;
   if 0 then set two;

   rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1508 views
  • 4 likes
  • 4 in conversation