Hi,
I am facing problem in reading each observation from input table and compare its values against the previous record or next record based on certain conditions in the logic. Appreciate if I can get any help from this group.
My current data looks like:
***************************************
data one;
infile datalines;
input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ;
format VALID_DTTM date9.;
datalines;
41 2673 2342321 1-Sep-12
41 41 12559928 1-Jan-13
41 4615581 14452649 1-Feb-13
41 4615581 67316104 1-Aug-14
41 41 72971041 1-Oct-14
41 4615581 74341618 1-Nov-14
41 8828883 63829111 1-Jul-15
9217 9217 21128775 1-Jan-13
9217 10124544 15142897 1-Jan-13
9217 9217 30874351 1-Mar-13
9217 9217 71276102 1-Sep-14
9217 10124544 74885982 1-Nov-14
346372 346372 6478392 1-Sep-14
7801669 6368875 2591876 1-Dec-12
7801669 7801669 3829995 1-Jan-13
7801669 7801669 97082679 1-Aug-15
7801669 6368875 106930088 1-Oct-15
7801669 7801669 107207912 1-Nov-15
;
run;
***************************************
The logic required is something like below:
For each curr_rk by group, I need to check as below:
>>
Scenario#1:
if first.curr_rk and curr_rk = orig_rk then assign
new_sk=CUST_SK and new_dt = VALID_DTTM
Scenario#2
if first.curr_rk and curr_rk ^= orig_rk then
go down below to the next matching record (i.e. with curr_rk = orig_rk)
Bring it's CUST_SK & VALID_DTTM to replace the values of the current record.
Scenario#3:
if not first.curr_rk and curr_rk = orig_rk then assign
new_sk=CUST_SK and new_dt = VALID_DTTM
Scenario#4:
if not first.curr_rk and curr_rk ^= orig_rk then
go up above to the nearest matching record (i.e. with curr_rk = orig_rk)
Bring it's CUST_SK & VALID_DTTM to replace the values of the current record.
In this process, I have sorted the input table and used "retain" statement to fetch desired values. But challenge is my input table sometimes has "non-matching" record occuring first and matching-record is followed later on. In such cases, retain statement is not working. Hence I thought of creating two date range variables (MAX_DT_BFOR_VALID_DTTM & MIN_DT_AFTR_VALID_DTTM) based on each record's VALID_DTTM compared against its corresponding pre-match and post-match records.
Please refer the attached excel sheet with desired output results shown.
I am ok if there is a better approach with out using these variables. Also I need to create the output in Oracle.
Hope I am able to explain the issue well. Please let me know if any more information is required.
Thanks a lot in advance.
OK. Maybe I understand what you mean. You only want PRE date which is less than or equal with the current obs , Do not want any POST date ? data one; infile datalines expandtabs truncover; input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ; format VALID_DTTM date9.; datalines; 41 2673 2342321 1-Sep-12 41 41 12559928 1-Jan-13 41 4615581 14452649 1-Feb-13 41 4615581 67316104 1-Aug-14 41 41 72971041 1-Oct-14 41 4615581 74341618 1-Nov-14 41 8828883 63829111 1-Jul-15 9217 9217 21128775 1-Jan-13 9217 10124544 15142897 1-Jan-13 9217 9217 30874351 1-Mar-13 9217 9217 71276102 1-Sep-14 9217 10124544 74885982 1-Nov-14 346372 346372 6478392 1-Sep-14 7801669 6368875 2591876 1-Dec-12 7801669 7801669 3829995 1-Jan-13 7801669 7801669 97082679 1-Aug-15 7801669 6368875 106930088 1-Nov-15 7801669 7801669 107207912 1-Nov-15 ; run; data have; do total=1 by 1 until(last.curr_rk); set one; by curr_rk; end; do n=1 by 1 until(last.curr_rk); set one; by curr_rk; output; end; run; data want; if _n_=1 then do; if 0 then set have(rename=(CUST_SK=new_sk VALID_DTTM=new_dt)); declare hash h2(); h2.definekey('CURR_RK','n'); h2.definedata('new_sk','new_dt'); h2.definedone(); end; do until(last.curr_rk); set have; by curr_rk; if CURR_RK=ORIG_RK then do; new_sk=CUST_SK; new_dt=VALID_DTTM;h2.add(); end; end; do until(last.curr_rk); set have; by curr_rk; if CURR_RK=ORIG_RK then do;want_sk=CUST_SK;want_dt=VALID_DTTM; end; else do; do k=n-1 to 1 by -1; call missing(new_sk,new_dt); rc=h2.find(key:CURR_RK,key:k); if rc=0 then do; want_sk=new_sk; want_dt=new_dt; leave; end; end; do k=n+1 to total; call missing(new_sk,new_dt); rc=h2.find(key:CURR_RK,key:k); if rc=0 then do; if new_dt=VALID_DTTM or first.curr_rk then do; want_sk=new_sk; want_dt=new_dt; end; leave; end; end; end; output; end; h2.clear(); format want_dt date9.; drop rc new_sk new_dt n k total; run;
If you need to get data from 'later' records, how about reversing the order of the dataset, so you can use retain then?
Hi,
Thanks for the quick response.
I need both. Some times I need previous record values and some times I need later record values depending on the conditions.
Do you mean, first fetch from the previous records into a dataset and then sort it again in reverse order to fetch values from later records? Will it work?
Yes.
If you do not already have a variable that defines your order, then you can create one in the first pass by using _N_.
Then sort descending by that variable, do your reverse pass, and then sort again (without descending) to get the original order.
If you use out= in the proc sort statement to create new datasets, you can add a drop= dataset option to get rid of the "order" variable.
Here's an example to illustrate what I mean:
data have;
input value condition;
cards;
3 0
4 1
5 2
6 0
7 1
8 0
;
run;
data firstpass (drop=keepval);
set have;
retain keepval;
order = _n_;
if condition = 0 then keepval = value;
if condition = 1 then value = keepval;
run;
proc sort data=firstpass;
by descending order;
run;
data secondpass (drop=keepval);
set firstpass;
retain keepval;
if condition = 0 then keepval = value;
if condition = 2 then value = keepval;
run;
proc sort
data=secondpass
out=want (drop=order)
;
by order;
run;
proc print;
run;
In the end, you get this output:
Obs value condition 1 3 0 2 3 1 3 6 2 4 6 0 5 6 1 6 8 0
PS if you only need data from the immediately preceding observation in both directions, you can also use the lag() function.
ITs a bit oif a confusing post here, basically you need the current record, the previous record which matches, and the next record that matches, merge these three together, then perform your logic check. My question is how do you know the order of the data, does a "previous" record mean one that matches the criteria and has max(date) < current record? If so then something like:
proc sql; create table WANT as select A.*, B.VALID_DTTM as PREV_DTTM, C.VALID_DTTM as NEXT_DTTM from ONE A left join (select * from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM < A.VALID_DTTM and VALID_DTTM=max(VALID_DTTM)) B on A.CURR_RK=B.CURR_RK left join (select * from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM > A.VALID_DTTM and VALID_DTTM=min(VALID_DTTM)) C on A.CURR_RK=C.CURR_RK; quit; data want; set want; by curr_rk; if first.curr_rk and ... run;
Hi RW9,
Yes, you got the question right. I want to merge those 3 records (pre-match, current, post-match) and create two new date variables
MAX_DT_BFOR_VALID_DTTM & MIN_DT_AFTR_VALID_DTTM as shown below.
I will then use these two date variables to do the further logic.
Desired Output for me:
CURR_RK | ORIG_RK | CUST_SK | VALID_DTTM | FLAG | MAX_DT_BFOR_VALID_DTTM | MIN_DT_AFTR_VALID_DTTM | EXPECTED FINAL VALID_DTTM | EXPECTED FINAL CUST_SK |
41 | 2673 | 2342321 | 1-Sep-12 | FALSE | - | 1-Jan-13 | 1-Jan-13 | 12559928 |
41 | 41 | 12559928 | 1-Jan-13 | TRUE | 1-Jan-13 | 1-Oct-14 | 1-Jan-13 | 12559928 |
41 | 4615581 | 14452649 | 1-Feb-13 | FALSE | 1-Jan-13 | 1-Oct-14 | 1-Jan-13 | 12559928 |
41 | 4615581 | 67316104 | 1-Aug-14 | FALSE | 1-Jan-13 | 1-Oct-14 | 1-Jan-13 | 12559928 |
41 | 41 | 72971041 | 1-Oct-14 | TRUE | 1-Oct-14 | - | 1-Oct-14 | 72971041 |
41 | 4615581 | 74341618 | 1-Nov-14 | FALSE | 1-Oct-14 | - | 1-Oct-14 | 72971041 |
41 | 8828883 | 63829111 | 1-Jul-15 | FALSE | 1-Oct-14 | - | 1-Oct-14 | 72971041 |
9217 | 9217 | 21128775 | 1-Jan-13 | TRUE | 1-Jan-13 | 1-Mar-13 | 1-Jan-13 | 21128775 |
9217 | 10124544 | 15142897 | 1-Jan-13 | FALSE | 1-Jan-13 | 1-Mar-13 | 1-Jan-13 | 21128775 |
9217 | 9217 | 30874351 | 1-Mar-13 | TRUE | 1-Mar-13 | 1-Sep-14 | 1-Mar-13 | 30874351 |
9217 | 9217 | 71276102 | 1-Sep-14 | TRUE | 1-Sep-14 | 1-Nov-14 | 1-Sep-14 | 71276102 |
9217 | 10124544 | 74885982 | 1-Nov-14 | FALSE | 1-Sep-14 | - | 1-Sep-14 | 71276102 |
346372 | 346372 | 6478392 | 1-Sep-14 | TRUE | 1-Sep-14 | 1-Sep-14 | 1-Sep-14 | 6478392 |
7801669 | 6368875 | 2591876 | 1-Dec-12 | FALSE | - | 1-Jan-13 | 1-Jan-13 | 3829995 |
7801669 | 7801669 | 3829995 | 1-Jan-13 | TRUE | 1-Jan-13 | 1-Aug-15 | 1-Jan-13 | 3829995 |
7801669 | 7801669 | 97082679 | 1-Aug-15 | TRUE | 1-Aug-15 | 1-Nov-15 | 1-Aug-15 | 97082679 |
7801669 | 6368875 | 106930088 | 1-Nov-15 | FALSE | 1-Nov-15 | 1-Nov-15 | 1-Nov-15 | 107207912 |
7801669 | 7801669 | 107207912 | 1-Nov-15 | TRUE | 1-Nov-15 | - | 1-Nov-15 | 107207912 |
Thanks for the logic provided. I tried it in SAS but it's throwing CORRELATION errors.
proc sql;
create table WANT as
select A.*,
B.VALID_DTTM as PREV_DTTM,
C.VALID_DTTM as NEXT_DTTM
from ONE A
left join (select * from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM < A.VALID_DTTM and VALID_DTTM=max(VALID_DTTM)) B
on A.CURR_RK=B.CURR_RK
left join (select * from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM > A.VALID_DTTM and VALID_DTTM=min(VALID_DTTM)) C
on A.CURR_RK=C.CURR_RK;
quit;
SQL error:
ERROR: Correlated reference to column CURR_RK is not contained within a subquery.
ERROR: Correlated reference to column VALID_DTTM is not contained within a subquery.
ERROR: Correlated reference to column CURR_RK is not contained within a subquery.
ERROR: Correlated reference to column VALID_DTTM is not contained within a subquery.
The query seems to be correct but I am not sure what is causing the issue.
I need to investigate and see if it gives me the desired new date variables.
Thanks once again.
Just re-jig a bit:
proc sql; create table WANT as select A.*, (select distinct VALID_DTTM from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM < A.VALID_DTTM and VALID_DTTM=max(VALID_DTTM)) as PREV_DTTM, (select distinct VALID_DTTM from ONE group by CURR_RK having CURR_RK=A.CURR_RK and VALID_DTTM > A.VALID_DTTM and VALID_DTTM=min(VALID_DTTM)) as NEXT_DTTM from ONE A; quit;
No. Post data here. No one would like to download file from website. data one; infile datalines; input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ; format VALID_DTTM date9.; datalines; 41 2673 2342321 1-Sep-12 41 41 12559928 1-Jan-13 41 4615581 14452649 1-Feb-13 41 4615581 67316104 1-Aug-14 41 41 72971041 1-Oct-14 41 4615581 74341618 1-Nov-14 41 8828883 63829111 1-Jul-15 9217 9217 21128775 1-Jan-13 9217 10124544 15142897 1-Jan-13 9217 9217 30874351 1-Mar-13 9217 9217 71276102 1-Sep-14 9217 10124544 74885982 1-Nov-14 346372 346372 6478392 1-Sep-14 7801669 6368875 2591876 1-Dec-12 7801669 7801669 3829995 1-Jan-13 7801669 7801669 97082679 1-Aug-15 7801669 6368875 106930088 1-Oct-15 7801669 7801669 107207912 1-Nov-15 ; run; data one; set one; by curr_rk; if first.curr_rk then n=0; n+1; run; data want; if _n_=1 then do; if 0 then set one(rename=(CUST_SK=new_sk VALID_DTTM=new_dt)); declare hash h1(dataset:'one(rename=(CUST_SK=new_sk VALID_DTTM=new_dt) where=(CURR_RK=ORIG_RK))'); h1.definekey('CURR_RK','ORIG_RK'); h1.definedata('new_sk','new_dt'); h1.definedone(); if 0 then set one(keep=ORIG_RK rename=(ORIG_RK=_ORIG_RK)); declare hash h2(dataset:'one(rename=(ORIG_RK=_ORIG_RK CUST_SK=new_sk VALID_DTTM=new_dt))'); h2.definekey('CURR_RK','n'); h2.definedata('new_sk','new_dt','_ORIG_RK'); h2.definedone(); end; set one; by curr_rk; if first.curr_rk then do; if curr_rk = orig_rk then do;new_sk=CUST_SK; new_dt=VALID_DTTM;end; else do;call missing(new_sk,new_dt);rc=h1.find(key:CURR_RK,key:CURR_RK);end; end; else do; if curr_rk = orig_rk then do;new_sk=CUST_SK; new_dt=VALID_DTTM;end; else do; do k=n-1 to 1 by -1; call missing(new_sk,new_dt); rc=h2.find(key:CURR_RK,key:k); if CURR_RK=_ORIG_RK then leave; end; end; end; drop rc k n _ORIG_RK; run;
Hi Ksharp,
Thank you very much for the solution logic provided. I am able to create new_sk & new_dttm with this logic but we found a small here is a small challenge to it. For e.g. if you observe last 3 records in below table, the 3rd record which is a matching record should get as is new_sk & new_dttm. Where as since 4th record is a non-match record, we need to verify its VALID_DTTM against pre-match record and also with post-match record values. In this case, the 5th record VALID_DTTM matches to the 4th record VALID_DTTM and hence we need to get its CUST_SK, VALID_DTTM and replace the values in the 4th record. Else it should pick the values from 3rd record (pre-matching record) irrespective of its VALID_DTTM values.
Because of this complexity, we want to create two seperate date variables based on pre-match VALID_DTTM value and post-match VALID_DTTM value. Then take the min of them as FINAL_DTTM. Accordingly replace the CUST_SK values for each record.
They also want to use these dates in further jobs to implement some logic.
CURR_RK | ORIG_RK | CUST_SK | VALID_DTTM | VALID_END_DTTM | FLAG | MAX_DT_BFOR_VALID_DTTM | MIN_DT_AFTR_VALID_DTTM | EXPECTED FINAL VALID_DTTM | EXPECTED FINAL CUST_SK |
7801669 | 6368875 | 2591876 | 1-Dec-12 | 31-Oct-15 | FALSE | - | 1-Jan-13 | 1-Jan-13 | 3829995 |
7801669 | 7801669 | 3829995 | 1-Jan-13 | 31-Jul-15 | TRUE | 1-Jan-13 | 1-Aug-15 | 1-Jan-13 | 3829995 |
7801669 | 7801669 | 97082679 | 1-Aug-15 | 31-Oct-15 | TRUE | 1-Aug-15 | 1-Nov-15 | 1-Aug-15 | 97082679 |
7801669 | 6368875 | 106930088 | 1-Nov-15 | 31-Dec-47 | FALSE | 1-Nov-15 | 1-Nov-15 | 1-Nov-15 | 107207912 |
7801669 | 7801669 | 107207912 | 1-Nov-15 | 31-Dec-47 | TRUE | 1-Nov-15 | - | 1-Nov-15 | 107207912 |
Hope I am able to explain well. Please help if you have any alternative solution for this.
Thank you in advance.
So you want the nearest date no matter it PRE date or POST date? But if PRE and POST have the same different days like below , which one you gonna to pick ? 7801669 7801669 97082679 1-Nov-15 7801669 6368875 106930088 1-Nov-15 7801669 7801669 107207912 1-Nov-15 Assuming both either you could accept : data one; infile datalines expandtabs truncover; input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ; format VALID_DTTM date9.; datalines; 41 2673 2342321 1-Sep-12 41 41 12559928 1-Jan-13 41 4615581 14452649 1-Feb-13 41 4615581 67316104 1-Aug-14 41 41 72971041 1-Oct-14 41 4615581 74341618 1-Nov-14 41 8828883 63829111 1-Jul-15 9217 9217 21128775 1-Jan-13 9217 10124544 15142897 1-Jan-13 9217 9217 30874351 1-Mar-13 9217 9217 71276102 1-Sep-14 9217 10124544 74885982 1-Nov-14 346372 346372 6478392 1-Sep-14 7801669 6368875 2591876 1-Dec-12 7801669 7801669 3829995 1-Jan-13 7801669 7801669 97082679 1-Aug-15 7801669 6368875 106930088 1-Nov-15 7801669 7801669 107207912 1-Nov-15 ; run; proc sql; create table key as select *,max(VALID_DTTM) as max_date,min(VALID_DTTM) as min_date from one group by curr_rk order by curr_rk,valid_dttm; quit; data want; if _n_=1 then do; if 0 then set key(rename=(CUST_SK=new_sk VALID_DTTM=new_dt)); declare hash h1(dataset:'one(rename=(CUST_SK=new_sk VALID_DTTM=new_dt) where=(CURR_RK=ORIG_RK))'); h1.definekey('CURR_RK','new_dt'); h1.definedata('new_sk','new_dt'); h1.definedone(); end; set key; dif=99999; do i=min_date to max_date; call missing(new_sk,new_dt); rc=h1.find(key:CURR_RK,key:i); if rc=0 then do; abs=abs(VALID_DTTM-new_dt); if abs
Hi,
In such case, I should pick the 1st record values and place it into the 2nd record.
7801669 7801669 97082679 1-Nov-15
7801669 6368875 106930088 1-Nov-15
7801669 7801669 107207912 1-Nov-15
New data would be:
7801669 7801669 97082679 1-Nov-15
7801669 6368875 97082679 1-Nov-15
7801669 7801669 107207912 1-Nov-15
Anyway, the previous program you have sent is working fine now as I sorted the data with matching flag now.
The new problem I am facing is the HASH object execution failure due to insufficient memory in Work.
Can I push this entire code onto Oracle database processing? Will Hash programs work on Oracle using SQL pass through?
Hi Ksharp,
I tried to execute the sas code you have given (using Hash object) through DI Studio on our UAT environment but it is failing:
ERROR: Hash object added 85983216 items when memory failure occured
FATAL : Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS system stopped processing this step because of insufficient memory.
My Input table is having 112937794 observations and hence HASH object process is causing the failure.
I verified my "xmrlmem" option and it is currently set to 6442450944.
I am not getting any other option handle this issue. Do you have any suggestions?
OK. Use DOW skill. But you still don't answer my question yet ? data one; infile datalines expandtabs truncover; input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ; format VALID_DTTM date9.; datalines; 41 2673 2342321 1-Sep-12 41 41 12559928 1-Jan-13 41 4615581 14452649 1-Feb-13 41 4615581 67316104 1-Aug-14 41 41 72971041 1-Oct-14 41 4615581 74341618 1-Nov-14 41 8828883 63829111 1-Jul-15 9217 9217 21128775 1-Jan-13 9217 10124544 15142897 1-Jan-13 9217 9217 30874351 1-Mar-13 9217 9217 71276102 1-Sep-14 9217 10124544 74885982 1-Nov-14 346372 346372 6478392 1-Sep-14 7801669 6368875 2591876 1-Dec-12 7801669 7801669 3829995 1-Jan-13 7801669 7801669 97082679 1-Aug-15 7801669 6368875 106930088 1-Nov-15 7801669 7801669 107207912 1-Nov-15 ; run; proc sql; create table key as select *,max(VALID_DTTM) as max_date,min(VALID_DTTM) as min_date from one group by curr_rk order by curr_rk,valid_dttm; quit; data want; if _n_=1 then do; if 0 then set key(rename=(CUST_SK=new_sk VALID_DTTM=new_dt)); declare hash h1(hashexp:20); h1.definekey('CURR_RK','new_dt'); h1.definedata('new_sk','new_dt'); h1.definedone(); end; do until(last.curr_rk); set key; by curr_rk; if CURR_RK=ORIG_RK then do; new_sk=CUST_SK; new_dt=VALID_DTTM; rc=h1.add(); end; end; do until(last.curr_rk); set key; by curr_rk; dif=99999; do i=min_date to max_date; call missing(new_sk,new_dt); rc=h1.find(key:CURR_RK,key:i); if rc=0 then do; abs=abs(VALID_DTTM-new_dt); if abs
My code has been truncated and I can't delete it . Try this one . data one; infile datalines expandtabs truncover; input CURR_RK ORIG_RK CUST_SK VALID_DTTM: anydtdte10. ; format VALID_DTTM date9.; datalines; 41 2673 2342321 1-Sep-12 41 41 12559928 1-Jan-13 41 4615581 14452649 1-Feb-13 41 4615581 67316104 1-Aug-14 41 41 72971041 1-Oct-14 41 4615581 74341618 1-Nov-14 41 8828883 63829111 1-Jul-15 9217 9217 21128775 1-Jan-13 9217 10124544 15142897 1-Jan-13 9217 9217 30874351 1-Mar-13 9217 9217 71276102 1-Sep-14 9217 10124544 74885982 1-Nov-14 346372 346372 6478392 1-Sep-14 7801669 6368875 2591876 1-Dec-12 7801669 7801669 3829995 1-Jan-13 7801669 7801669 97082679 1-Aug-15 7801669 6368875 106930088 1-Nov-15 7801669 7801669 107207912 1-Nov-15 ; run; proc sql; create table key as select *,max(VALID_DTTM) as max_date,min(VALID_DTTM) as min_date from one group by curr_rk order by curr_rk,valid_dttm; quit; data want; if _n_=1 then do; if 0 then set key(rename=(CUST_SK=new_sk VALID_DTTM=new_dt)); declare hash h1(hashexp:20); h1.definekey('CURR_RK','new_dt'); h1.definedata('new_sk','new_dt'); h1.definedone(); end; do until(last.curr_rk); set key; by curr_rk; if CURR_RK=ORIG_RK then do; new_sk=CUST_SK; new_dt=VALID_DTTM; rc=h1.add(); end; end; do until(last.curr_rk); set key; by curr_rk; dif=99999; do i=min_date to max_date; call missing(new_sk,new_dt); rc=h1.find(key:CURR_RK,key:i); if rc=0 then do; abs=abs(VALID_DTTM-new_dt); if abs lt dif then do; dif=abs;want_sk=new_sk;want_dt=new_dt; end; end; end; output; end; h1.clear(); format want_dt date9.; drop rc dif i min_date max_date abs new_sk new_dt; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.