Please see this question for context:
The answer given works perfectly but the file size is an issue as with large files it blows out considerably. I'm wondering if the approach below might be used which is considerably lighter on resources. I just cant work out how to get the date periods from it. I have a set of dates per clients and I need to exclude periods they have have paid for - and need to output the exact dates. Sample data and code to date are below. Many thanks in advance!
data phone ;
infile datalines delimiter=',';
input clientid $ sharerid $ phone $ startdt enddt ;
informat startdt enddt date9. ;
format startdt enddt date9. ;
datalines;
client1,sharer1,555-6532,21Nov2011,10Dec2012
client1,sharer1,444-5655,29Nov2010,14Feb2011
client1,sharer1,333-1234,20May1993,17Aug1993
client1,sharer1,333-1234,08Sep1993,08Sep1993
client2,sharer2,666-6548,10Jul2001,12Nov2001
client2,sharer2,666-6548,10Apr2002,06Aug2002
client2,sharer2,111-5658,02Nov1992,12Aug1993
client2,sharer2,222-6589,10Jan2000,31Jan2000
client2,sharer2,777-8755,31Jan2000,03Feb2000
client2,sharer2,777-8755,25Jun2009,14Sep2009
client2,sharer2,321-6544,18Dec2003,08Apr2004
client2,sharer2,778-6589,07Jun2001,10Jul2001
client2,sharer2,999-9988,31Dec1993,26Mar1994
client2,sharer2,999-9988,28Mar1994,28Mar1994
client2,sharer2,888-7845,12Aug1993,23Aug1993
client2,sharer2,789-9876,10Aug1994,05Sep1994
client2,sharer2,789-9876,22Jun1995,10Jul1995
client2,sharer2,951-6235,08Apr2004,10Aug2004
client2,sharer2,753-1245,25Jan2007,18Jul2007
client2,sharer2,656-8989,12Nov1998,26Feb1999
client2,sharer2,656-8989,10Dec1999,10Jan2000
client2,sharer2,141-1414,23Aug2000,26Mar2001
client2,sharer2,141-1414,07Jun2001,10Jul2001
client2,sharer2,363-3636,19Jun1998,12Nov1998
client2,sharer2,852-8525,18Jun2009,02Jun2010
client2,sharer2,852-8525,20Oct2010,16May2011
client2,sharer2,852-8525,31May2012,10Dec2012
client2,sharer2,565-5656,05Sep1994,01Nov1994
client2,sharer2,565-5656,14Nov1994,14Nov1994
client2,sharer2,221-2212,01jan2012,30jan2012
;
data paid_periods;
infile datalines delimiter=',';
input clientid $ sharerid $ paystart payend ;
informat paystart payend date9. ;
format paystart payend date9. ;
datalines;
client2,sharer2,31Aug1992,23Aug1993
client2,sharer2,25Dec1993,26Mar1994
client2,sharer2,10Aug1994,01Nov1994
client2,sharer2,15Mar1995,20Mar1995
client2,sharer2,19Jun1998,18Feb1999
client2,sharer2,10Dec1999,31Jan2000
client2,sharer2,20Jun2001,12Nov2001
client2,sharer2,10Apr2002,27Jul2002
client2,sharer2,18Dec2003,06Aug2004
client2,sharer2,11Dec2006,17Jul2007
client2,sharer2,18Jun2009,01Jun2010
client2,sharer2,20Oct2010,13May2011
client2,sharer2,14jan2012,21jan2012
;
proc sql; |
create table ph as
select distinct clientid,sharerid,phone | |
from phone |
;
create table pays as
select distinct rs.*,ph.phone | |
from paid_periods rs, ph | |
where rs.clientid=ph.clientid and rs.sharerid=ph.sharerid |
;
quit;
data phprd(keep=phone clientid sharerid phone_flag date) ;
set phone ;
phone_flag=1;
date=startdt;
output;
phone_flag=0;
date=enddt;
output;
run;
proc sort data=phprd;
by phone clientid sharerid date;
run;
data payprd(keep = phone clientid sharerid pay_flag date) ;
set pays;
pay_flag=1;
date=paystart;
output;
pay_flag=0;
date=payend;
output;
run;
proc sort data=payprd;
by phone clientid sharerid date;
run;
data timeline ;
set payprd phprd;
format date ddmmyy10. clientid sharerid 12.;
by phone clientid sharerid date;
run;
cheers
Steve
Does my original jackhammer approach blow out your system's memory as well? It shouldn't as, if it will work for one record, it should work for any number of records as well.
How many records do you have in each of the two files?
It varies but can be over 4000 plus the periods can go right back to the 1990's which causing huge file bloat.
Does my original jackhammer approach blow out your system's memory as well? It shouldn't as, if it will work for one record, it should work for any number of records as well.
Hiya, never thought of that and yes you're quite correct. Used the Jack Hammer approach on my live data and it worked a charm. Thank you very much again.
cheers
Steve
That must be why they invented JackHammers! Not applicable for all tasks but, sometimes, definitely the right tool for the job.
Steve,
A couple of things to consider as well ...
Memory is cheap, at least where computers are concerned. Buy more?
The memory requirements can be reduced by breaking up the data into batches. Subsets would be based on ranges of clientid, for example, but make sure the same client IDs appear in both data sets.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.