hello,
I've been struggling to solve the following. I have large dataset ordered by acct ID, from date, to date (sorted descending), and amount (added the last two dummy columns as shown below).
**Editing table for clarity.
The first record will always contain a negative amount, with a to date [eff_to_dt0] of 31dec9999.
Trying to obtain only, by customer ID, each (i) consecutive record(s) where (ii) the amount is negative.
The consecutive record has a TO DATE column, which would always be 1 day prior, to preceding record's FROM DATE column.
Data have:
ID | eff_from_dt0 | eff_to_dt0 | crnt_bal_amt0 | GROUP_ID | ROW |
000000090 | 22-Jun-08 | 31-Dec-99 | -3.62 | 8 | 1731 |
000000079 | 22-Jun-08 | 31-Dec-99 | -1926.52 | 9 | 1732 |
000000002 | 4-Oct-18 | 31-Dec-99 | -17.11 | 10 | 1733 |
000000002 | 4-Sep-18 | 3-Oct-18 | -18.57 | 10 | 1734 |
000000002 | 4-Aug-18 | 3-Sep-18 | -20.03 | 10 | 1735 |
000000002 | 4-Jul-18 | 3-Aug-18 | -21.49 | 10 | 1736 |
000000002 | 28-Dec-17 | 3-Jul-18 | -22.95 | 10 | 1737 |
000000002 | 27-Dec-17 | 27-Dec-17 | -56.19 | 10 | 1738 |
000000002 | 18-Nov-17 | 26-Dec-17 | -51.19 | 10 | 1739 |
000000002 | 24-Apr-17 | 17-Nov-17 | -49.38 | 10 | 1740 |
000000002 | 27-Mar-17 | 23-Apr-17 | -79.38 | 10 | 1741 |
000000002 | 18-Mar-17 | 26-Mar-17 | 101.5 | 10 | 1742 |
000000002 | 6-Mar-17 | 17-Mar-17 | 99.38 | 10 | 1743 |
000000002 | 9-Feb-17 | 5-Mar-17 | 79.38 | 10 | 1744 |
000000002 | 30-Jan-17 | 8-Feb-17 | 47.7 | 10 | 1745 |
000000069 | 20-Jun-09 | 31-Dec-99 | -6819.53 | 11 | 1888 |
000000069 | 22-Jun-08 | 19-Jun-09 | 6164.97 | 11 | 1889 |
000000013 | 12-Nov-17 | 31-Dec-99 | -23.96 | 12 | 1890 |
000000013 | 12-Oct-17 | 11-Nov-17 | -16.62 | 12 | 1891 |
000000013 | 28-Sep-17 | 11-Oct-17 | -20 | 12 | 1892 |
000000013 | 12-Sep-17 | 27-Sep-17 | 460.74 | 12 | 1893 |
000000013 | 29-Aug-17 | 11-Sep-17 | 449.34 | 12 | 1894 |
000000013 | 17-Aug-17 | 28-Aug-17 | 949.34 | 12 | 1895 |
Data want:
ID | eff_from_dt0 | eff_to_dt0 | crnt_bal_amt0 | GROUP_ID | ROW |
000000090 | 22-Jun-08 | 31-Dec-99 | -3.62 | 8 | 1731 |
000000079 | 22-Jun-08 | 31-Dec-99 | -1926.52 | 9 | 1732 |
000000002 | 4-Oct-18 | 31-Dec-99 | -17.11 | 10 | 1733 |
000000002 | 4-Sep-18 | 3-Oct-18 | -18.57 | 10 | 1734 |
000000002 | 4-Aug-18 | 3-Sep-18 | -20.03 | 10 | 1735 |
000000002 | 4-Jul-18 | 3-Aug-18 | -21.49 | 10 | 1736 |
000000002 | 28-Dec-17 | 3-Jul-18 | -22.95 | 10 | 1737 |
000000002 | 27-Dec-17 | 27-Dec-17 | -56.19 | 10 | 1738 |
000000002 | 18-Nov-17 | 26-Dec-17 | -51.19 | 10 | 1739 |
000000002 | 24-Apr-17 | 17-Nov-17 | -49.38 | 10 | 1740 |
000000002 | 27-Mar-17 | 23-Apr-17 | -79.38 | 10 | 1741 |
000000069 | 20-Jun-09 | 31-Dec-99 | -6819.53 | 11 | 1888 |
000000013 | 12-Nov-17 | 31-Dec-99 | -23.96 | 12 | 1890 |
000000013 | 12-Oct-17 | 11-Nov-17 | -16.62 | 12 | 1891 |
000000013 | 28-Sep-17 | 11-Oct-17 | -20 | 12 | 1892 |
*** As suggested, i'm adding a large sample in the attached csv file, already sorted. VAR Group_ID marks each new distinct ID, which with first record always have a negative balance, and the TO DATE of 31dec9999 (ie, current). Thanks
We tried using LAG but as indicated elsewhere it led to missing records. Any help is appreciated. Thanks,
Hi novinosrin, thanks for tip. I put an asterisk next to the desired record:
Id amount toDate;
*001 -1 9999
*001 -2 9998
001 3 9997
001 -3 9996
*002 -3 9999
*002 -3 9998
*003 -2 9997
003 2 9996
*004 -2 9999
*004 -1 9998
*004 -3 9997
004 3 9996
004 -3 9995
*005 -3 9999
*005 -1 9998
*005 -2 9997
005 0 9996
005 -2 9995
data have;
input Id amount toDate;
datalines;
001 -1 9999
001 -2 9998
001 3 9997
001 -3 9996
002 -3 9999
002 -3 9998
003 -2 9997
003 2 9996
004 -2 9999
004 -1 9998
004 -3 9997
004 3 9996
004 -3 9995
005 -3 9999
005 -1 9998
005 -2 9997
005 0 9996
005 -2 9995
run;
data want;
set have;
by id;
if first.id and amount<0 or amount<0 and lag(amount)<0 and id=lag(id) ;
run;
Hi,
This is similar to our own attempt using LAG, where testing using small sample worked, but when applying to larger sample it didn’t.
Example. I’ve recoded your code here for my real data:
data temp;
set a.all (obs=60000);
by SRC_SYS_ACCT_ID /*descending eff_to_dt0*/;
if first.SRC_SYS_ACCT_ID and crnt_bal_amt0<0 or crnt_bal_amt0<0 and lag(crnt_bal_amt0)<0
and SRC_SYS_ACCT_ID=lag(SRC_SYS_ACCT_ID);
run;
data tt1;set temp (where=(SRC_SYS_ACCT_ID='000000xxxxxx'));run;
data tt2;set a.all (where=(SRC_SYS_ACCT_ID='000000xxxxxx'));run;
Below is a visual of the output:
Here is real data, where only the first record should be outputted. Note the eff from dt0 precedes the eff to dt0 in the next record.
Table TT2 |
| ||||||||||||
Here is output using ur suggestion.There were in total 6 records generated, and note the break in the eff from dt0 , to the eff to dt0.
Table TT1 |
|
Any thoughts or suggestion? Apologies if my issue description was not clear, if you have a tip, pls let me know.
Thank you
Hmm the logic i wrote didn't have anything to do with the dates at all , and your real data has effective from date and effective to date.
So what concerns me is that does those dates have to be given serious consideration to the logic. Can you verify the business requirement, and make a better and representative sample(a good mock data) so I can play with.
Your real data rings the bell of some transaction data that I was part of transaction banking analytics couple of years ago which makes me think other variables may have be taken into account in the logic. Well, at this point, let me think through if something I can simulate and test with some reasoning and in the meantime, I am sure somebody else will perhaps give us the much needed solution.
Also, if you could get more details, that can only help. (IMHO, may be this is all one needs and my dumb brain isn't grasping lol) let's see
Lol luckily we have many people who answer here . 🙂 Anyways, something should work out and I will be stay focused until you close this thread
data have; input Id amount toDate; datalines; 001 -1 9999 001 -2 9998 001 3 9997 001 -3 9996 002 -3 9999 002 -3 9998 003 -2 9997 003 2 9996 004 -2 9999 004 -1 9998 004 -3 9997 004 3 9996 004 -3 9995 005 -3 9999 005 -1 9998 005 -2 9997 005 0 9996 005 -2 9995 ; run; PROC SORT DATA=have out=have2; BY id descending todate; RUN; data want; set have2; by id; tmp_=lag(amount); if first.id then tmp=amount; else tmp=tmp_; if amount < 0 and tmp < 0 then output; drop tmp_ tmp; run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.