Re: For given ID, keep consecutive records matching two conditions

brulard · Posted 11-08-2018 01:23 PM

hello,

I've been struggling to solve the following. I have large dataset ordered by acct ID, from date, to date (sorted descending), and amount (added the last two dummy columns as shown below).

**Editing table for clarity.

The first record will always contain a negative amount, with a to date [eff_to_dt0] of 31dec9999.

Trying to obtain only, by customer ID, each (i) consecutive record(s) where (ii) the amount is negative.

The consecutive record has a TO DATE column, which would always be 1 day prior, to preceding record's FROM DATE column.

Data have:

ID	eff_from_dt0	eff_to_dt0	crnt_bal_amt0	GROUP_ID	ROW
000000090	22-Jun-08	31-Dec-99	-3.62	8	1731
000000079	22-Jun-08	31-Dec-99	-1926.52	9	1732
000000002	4-Oct-18	31-Dec-99	-17.11	10	1733
000000002	4-Sep-18	3-Oct-18	-18.57	10	1734
000000002	4-Aug-18	3-Sep-18	-20.03	10	1735
000000002	4-Jul-18	3-Aug-18	-21.49	10	1736
000000002	28-Dec-17	3-Jul-18	-22.95	10	1737
000000002	27-Dec-17	27-Dec-17	-56.19	10	1738
000000002	18-Nov-17	26-Dec-17	-51.19	10	1739
000000002	24-Apr-17	17-Nov-17	-49.38	10	1740
000000002	27-Mar-17	23-Apr-17	-79.38	10	1741
000000002	18-Mar-17	26-Mar-17	101.5	10	1742
000000002	6-Mar-17	17-Mar-17	99.38	10	1743
000000002	9-Feb-17	5-Mar-17	79.38	10	1744
000000002	30-Jan-17	8-Feb-17	47.7	10	1745
000000069	20-Jun-09	31-Dec-99	-6819.53	11	1888
000000069	22-Jun-08	19-Jun-09	6164.97	11	1889
000000013	12-Nov-17	31-Dec-99	-23.96	12	1890
000000013	12-Oct-17	11-Nov-17	-16.62	12	1891
000000013	28-Sep-17	11-Oct-17	-20	12	1892
000000013	12-Sep-17	27-Sep-17	460.74	12	1893
000000013	29-Aug-17	11-Sep-17	449.34	12	1894
000000013	17-Aug-17	28-Aug-17	949.34	12	1895

Data want:

ID	eff_from_dt0	eff_to_dt0	crnt_bal_amt0	GROUP_ID	ROW
000000090	22-Jun-08	31-Dec-99	-3.62	8	1731
000000079	22-Jun-08	31-Dec-99	-1926.52	9	1732
000000002	4-Oct-18	31-Dec-99	-17.11	10	1733
000000002	4-Sep-18	3-Oct-18	-18.57	10	1734
000000002	4-Aug-18	3-Sep-18	-20.03	10	1735
000000002	4-Jul-18	3-Aug-18	-21.49	10	1736
000000002	28-Dec-17	3-Jul-18	-22.95	10	1737
000000002	27-Dec-17	27-Dec-17	-56.19	10	1738
000000002	18-Nov-17	26-Dec-17	-51.19	10	1739
000000002	24-Apr-17	17-Nov-17	-49.38	10	1740
000000002	27-Mar-17	23-Apr-17	-79.38	10	1741
000000069	20-Jun-09	31-Dec-99	-6819.53	11	1888
000000013	12-Nov-17	31-Dec-99	-23.96	12	1890
000000013	12-Oct-17	11-Nov-17	-16.62	12	1891
000000013	28-Sep-17	11-Oct-17	-20	12	1892

*** As suggested, i'm adding a large sample in the attached csv file, already sorted. VAR Group_ID marks each new distinct ID, which with first record always have a negative balance, and the TO DATE of 31dec9999 (ie, current). Thanks

We tried using LAG but as indicated elsewhere it led to missing records. Any help is appreciated. Thanks,

brulard · Posted 11-08-2018 01:48 PM

Hi novinosrin, thanks for tip. I put an asterisk next to the desired record:

Id amount toDate;

*001 -1 9999
*001 -2 9998
001 3 9997
001 -3 9996
*002 -3 9999
*002 -3 9998
*003 -2 9997
003 2 9996
*004 -2 9999
*004 -1 9998
*004 -3 9997
004 3 9996
004 -3 9995
*005 -3 9999
*005 -1 9998
*005 -2 9997
005 0 9996
005 -2 9995

novinosrin · Posted 11-08-2018 01:58 PM

data have;
input Id amount toDate;
datalines;
001 -1 9999
001 -2 9998
001 3 9997
001 -3 9996
002 -3 9999
002 -3 9998
003 -2 9997
003 2 9996
004 -2 9999
004 -1 9998
004 -3 9997
004 3 9996
004 -3 9995
005 -3 9999
005 -1 9998
005 -2 9997
005 0 9996
005 -2 9995
run;
data want;
set have;
by id;
if first.id and amount<0 or  amount<0 and lag(amount)<0 and id=lag(id) ;
run;

brulard · Posted 11-08-2018 02:02 PM

Thank you sir... this looks promising, will just test it out on my expanded dataset...

novinosrin · Posted 11-08-2018 02:02 PM

@brulard Thank you for clarifying

Ok Test and let me know sir/mam

brulard · Posted 11-08-2018 02:27 PM

Hi,

This is similar to our own attempt using LAG, where testing using small sample worked, but when applying to larger sample it didn’t.

Example. I’ve recoded your code here for my real data:

data temp;

set a.all (obs=60000);

by SRC_SYS_ACCT_ID /*descending eff_to_dt0*/;

if first.SRC_SYS_ACCT_ID and crnt_bal_amt0<0 or crnt_bal_amt0<0 and lag(crnt_bal_amt0)<0

and SRC_SYS_ACCT_ID=lag(SRC_SYS_ACCT_ID);

run;

data tt1;set temp (where=(SRC_SYS_ACCT_ID='000000xxxxxx'));run;

data tt2;set a.all (where=(SRC_SYS_ACCT_ID='000000xxxxxx'));run;

Below is a visual of the output:

Here is real data, where only the first record should be outputted.

Note the eff from dt0 precedes the eff to dt0 in the next record.

Table TT2

SRC_SYS_ACCT_ID	eff_from_dt0	eff_to_dt0	crnt_bal_amt0
000000xxxxxx	21mar2018	31dec9999	-0.110
000000xxxxxx	06MAR2018	20MAR2018	128.890

Here is output using ur suggestion.There were in total 6 records generated, and note the break in the eff from dt0 , to the eff to dt0.

Table TT1

SRC_SYS_ACCT_ID	eff_from_dt0	eff_to_dt0	crnt_bal_amt0
000000xxxxxx	21mar2018	31dec9999	-0.110
000000xxxxxx	29JAN2013	07FEB2013	-120.910

Any thoughts or suggestion? Apologies if my issue description was not clear, if you have a tip, pls let me know.

Thank you

novinosrin · Posted 11-08-2018 02:38 PM

Hmm the logic i wrote didn't have anything to do with the dates at all , and your real data has effective from date and effective to date.

So what concerns me is that does those dates have to be given serious consideration to the logic. Can you verify the business requirement, and make a better and representative sample(a good mock data) so I can play with.

Your real data rings the bell of some transaction data that I was part of transaction banking analytics couple of years ago which makes me think other variables may have be taken into account in the logic. Well, at this point, let me think through if something I can simulate and test with some reasoning and in the meantime, I am sure somebody else will perhaps give us the much needed solution.

Also, if you could get more details, that can only help. (IMHO, may be this is all one needs and my dumb brain isn't grasping lol) let's see

brulard · Posted 11-08-2018 02:44 PM

No no lol, you are very smart and I appreciate the posts and effort you put into helping folks such as myself that have less experience learn (and be patient in the process)...
Let me see tomorrow if can put a larger sample data set together... Also perhaps using LAG approach, may not be the ticket home.

novinosrin · Posted 11-08-2018 02:46 PM

Lol luckily we have many people who answer here . 🙂 Anyways, something should work out and I will be stay focused until you close this thread

r_behata · Posted 11-08-2018 02:37 PM

data have;
input Id amount toDate;
datalines;
001 -1 9999
001 -2 9998
001 3 9997
001 -3 9996
002 -3 9999
002 -3 9998
003 -2 9997
003 2 9996
004 -2 9999
004 -1 9998
004 -3 9997
004 3 9996
004 -3 9995
005 -3 9999
005 -1 9998
005 -2 9997
005 0 9996
005 -2 9995
;
run;

PROC SORT DATA=have out=have2; 
BY id descending todate; 
RUN;


data want;
	set have2;
	by id;
	
	tmp_=lag(amount);
	if first.id then	tmp=amount;
		else tmp=tmp_;

	if amount < 0 and tmp < 0 then output;

	drop tmp_ tmp;
run;

brulard · Posted 11-08-2018 02:54 PM

Hi r_behata, and thanks for the suggested code...
When applying to a larger dataset, it doesn't properly output, similar to my answer to novinosrin.
I will edit my POST to add an extra column tomorrow, and attempt to provide a larger dataset.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away