BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SteveNZ
Obsidian | Level 7

Hi SASers,

I have a seemingly simple problem that I cant seem to solve. I have a set of dates per clients and I need to exclude periods they have have paid for.

I need to output the dates left over once the paid periods are taken out. I've created some test data below that highlights the permutations I've found in the live data, any help is very much appreciated as it's driving me mad!

data phone ;

infile datalines delimiter=',';

input clientid $ sharerid $ phone $ startdt enddt ;

informat startdt enddt date9. ;

format startdt enddt date9. ;

datalines;

client1,client2,555-6532,01jan2012,03jan2012

client1,client2,555-6532,09jan2012,12jan2012

client1,client2,555-6532,21jan2012,30jan2012

client1,client2,555-6532,05feb2012,14feb2012

client1,client2,555-6532,20feb2012,23feb2012

client1,client2,555-6532,26feb2012,29feb2012

;

data paid_periods ;

infile datalines delimiter=',';

input clientid $ sharerid $ paystart payend ;

informat paystart payend date9. ;

format paystart payend date9. ;

datalines;

client1,client2,06jan2012,15jan2012

client1,client2,18jan2012,24jan2012

client1,client2,27jan2012,02feb2012

client1,client2,08feb2012,11feb2012

client1,client2,17feb2012,02mar2012

;

The results for the above should be:

01Jan2012 - 03Jan2012

18Jan2012 - 20Jan2012

31Jan2012 - 02Feb2012

05Feb2012 - 07Feb2012

12Feb2012 - 14Feb2012

Many thanks in advance

Steve

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Art,

Here's the rest of the format approach.  It picks up right after creating the format.  If I were going for clarity rather than speed, I would go about it this way:

data individual_unpaid_days;

   set phone;

   do day=startdt to enddt;

        if put(clientid || sharerid || put(day, yymmddn8.), $paid.) ne 'Paid' then output;

   end;

   drop startdt enddt;

run;

data assign_group;

   set individual_unpaid_days;

   by clientid sharerid phone;

   prior_day = lag(day);

   if first.phone or prior_day + 1 ne day then groupnum + 1;

   drop prior_day;

run;

data want;

   set assign_group;

   by clientid sharerid phone groupnum;

   if first.groupnum then start_of_range = day;

   retain start_of_range;

   if last.groupnum then do;

      end_of_range=day;

      output;

   end;

   drop day;

run;

Sorry, but all of this is untested code.  If I were going for speed rather than clarity, I might replace all three steps with:

data want;

   set phone;

   do day=startdt to enddt;

        result = put(clientid || sharerid || put(day, yymmddn8.), $paid.);

        if result ne 'Paid' then do;

           if start_of_range=. then start_of_range=day;

           end_of_range=day;

       end;

       else do;

           if start_of_range > . then output;

           start_of_range=.;

       end;

   end;

   if start_of_range > . then output;

   drop result day;

run; 

This actually allows for any number of unpaid groups of days appearing between startdt and enddt and outputs a separate observation for each group ... assuming it works!

View solution in original post

20 REPLIES 20
art297
Opal | Level 21

Steve,

I've probably misunderstood what you are looking for and, if I really do understand the problem, am probably offering a sledge hammer approach where something much simpler is likely available.

I say I may have misunderstood the problem, as my code doesn't quite provide your stated desired results.  My code only shows date ranges that are in the "phone" dataset that aren't also in the "paid_periods" dataset.

If that is really what you are looking for then try the following:

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,client2,555-6532,01jan2012,03jan2012

client1,client2,555-6532,09jan2012,12jan2012

client1,client2,555-6532,21jan2012,30jan2012

client1,client2,555-6532,05feb2012,14feb2012

client1,client2,555-6532,20feb2012,23feb2012

client1,client2,555-6532,26feb2012,29feb2012

;

 

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client1,client2,06jan2012,15jan2012

client1,client2,18jan2012,24jan2012

client1,client2,27jan2012,02feb2012

client1,client2,08feb2012,11feb2012

client1,client2,17feb2012,02mar2012

;

data want (keep=clientid sharerid phone date_range);

  set phone;

  by clientid sharerid;

  array dates(40000);

  retain dates: lowest highest;

  if first.sharerid then do;

    call missing(of dates(*));

    call missing(lowest);

    call missing(highest);

    call missing(first);

    call missing(last);

  end;

  lowest=min(lowest,startdt);

  highest=max(highest,enddt);

  do i=startdt to enddt;

    dates(i)=i;

  end;

  if last.sharerid then do;

    do until (last.sharerid2);

      set paid_periods (rename=(clientid=clientid2 sharerid=sharerid2));

      by clientid2 sharerid2;

      do i=paystart to payend;

        if not missing(dates(i)) then call missing(dates(i));

      end;

      if last.sharerid2 then do;

        do i=lowest to highest;

          if missing(first) and not missing(dates(i)) then first=dates(i);

          if not missing(first) and missing(dates(i+1)) then do;

            last=dates(i);

            date_range=catx("-",put(first,date9.),put(last,date9.));

            output;

            call missing(first);

            call missing(last);

          end;

        end;

      end;

    end;

  end;

run;

SteveNZ
Obsidian | Level 7

Hi Arthur,

Thank you very much for this, much appreciated. You are quite correct in your code as I managed to completely stuff up the results I wanted. This is exactly what I need, really appreciate the help.

cheers

Steve

art297
Opal | Level 21

Then, if you continue to use the code, as written it will only handle dates between the range of 01jan1960 and 07jul2069.  Will the suffice for your data?

SteveNZ
Obsidian | Level 7

Hmm actually the code isn't giving me the answers I need though no fault of your code. In my efforts to simplify the test data and not cloud the issue I didn't mention that the output dates need to be by phone number ie each clientid shareid can have a number of phone numbers and the dates need to be by number.

Complicating it maybe is that they can have a cellphone and a landline and the dates of these may overlap.

Running your code over my live data gives the dates perfectly but it clumps all phone numbers together as one. Is there a way to modify it so it's by phone?

Apologies for not being clearer.

cheers

Steve

art297
Opal | Level 21

Not sure what you are asking.  In the code I proposed everything is controlled by the variables that are in the two by statements and the subsequent first. and last. variables that are referenced.

The code was written to analyze the various clientid sharerid combinations. Phone isn't included in the paid periods data, so are you saying that you want to reuse that information for each clientid sharerid phone combination?


SteveNZ
Obsidian | Level 7

Yes pretty much. For each clientid/sharerid/phone number period in the phone data set I need to exclude any payment periods from the paid_periods data set. Unfortunately the payments dont include the phone number being paid for so I need it done by phone number.

When I run your code it appears to get the periods spot on but for each clientid/sharerid I get the same phone number being listed in the output.

cheers

Steve

art297
Opal | Level 21

Then, if file size isn't a problem, here is a sledge hammer/mechanical sledge hammer approach.  It is the same as what I had proposed earlier, but expands the paid_period file to account for all phone numbers.  Let us know if that solves the problem:

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,client2,555-6532,01jan2012,03jan2012

client1,client2,555-6532,09jan2012,12jan2012

client1,client2,555-6532,21jan2012,30jan2012

client1,client2,555-6532,05feb2012,14feb2012

client1,client2,555-6532,20feb2012,23feb2012

client1,client2,555-6532,26feb2012,29feb2012

client1,client2,999-6532,01jan2012,03jan2012

client1,client2,999-6532,09jan2012,12jan2012

client1,client2,999-6532,21jan2012,30jan2012

client1,client2,999-6532,05feb2012,14feb2012

client1,client2,999-6532,20feb2012,23feb2012

client1,client2,999-6532,26feb2012,29feb2012

;

proc sort data=phone out=unique_phone (keep=clientid sharerid phone) nodupkey;

  by clientid sharerid phone;

run;

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client1,client2,06jan2012,15jan2012

client1,client2,18jan2012,24jan2012

client1,client2,27jan2012,02feb2012

client1,client2,08feb2012,11feb2012

client1,client2,17feb2012,02mar2012

;

proc sql;

  create table paid_periods_expanded

    as select a.*,b.phone as phone2

      from paid_periods a,unique_phone b

         having a.clientid=b.clientid and

                a.sharerid=b.sharerid

        order by clientid,sharerid,phone,paystart

  ;

quit;

data want (keep=clientid sharerid phone date_range);

  set phone;

  by clientid sharerid phone;

  array dates(40000);

  retain dates: lowest highest;

  if first.phone then do;

    call missing(of dates(*));

    call missing(lowest);

    call missing(highest);

    call missing(first);

    call missing(last);

  end;

  lowest=min(lowest,startdt);

  highest=max(highest,enddt);

  do i=startdt to enddt;

    dates(i)=i;

  end;

  if last.phone then do;

    do until (last.phone2);

      set paid_periods_expanded

          (rename=(clientid=clientid2 sharerid=sharerid2));

      by clientid2 sharerid2 phone2;

      do i=paystart to payend;

        if not missing(dates(i)) then call missing(dates(i));

      end;

      if last.phone2 then do;

        do i=lowest to highest;

          if missing(first) and not missing(dates(i)) then first=dates(i);

          if not missing(first) and missing(dates(i+1)) then do;

            last=dates(i);

            date_range=catx("-",put(first,date9.),put(last,date9.));

            output;

            call missing(first);

            call missing(last);

          end;

        end;

      end;

    end;

  end;

run;

SteveNZ
Obsidian | Level 7

Hi Arthur,

Not working as required on live data sorry, I've created some new test data which is based on live stuff:

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,sharer1,555-6532,21Nov2011,10Dec2012

client1,sharer1,444-5655,29Nov2010,14Feb2011

client1,sharer1,333-1234,20May1993,17Aug1993

client1,sharer1,333-1234,08Sep1993,08Sep1993

client2,sharer2,666-6548,10Jul2001,12Nov2001

client2,sharer2,666-6548,10Apr2002,06Aug2002

client2,sharer2,111-5658,02Nov1992,12Aug1993

client2,sharer2,222-6589,10Jan2000,31Jan2000

client2,sharer2,777-8755,31Jan2000,03Feb2000

client2,sharer2,777-8755,25Jun2009,14Sep2009

client2,sharer2,321-6544,18Dec2003,08Apr2004

client2,sharer2,778-6589,07Jun2001,10Jul2001

client2,sharer2,999-9988,31Dec1993,26Mar1994

client2,sharer2,999-9988,28Mar1994,28Mar1994

client2,sharer2,888-7845,12Aug1993,23Aug1993

client2,sharer2,789-9876,10Aug1994,05Sep1994

client2,sharer2,789-9876,22Jun1995,10Jul1995

client2,sharer2,951-6235,08Apr2004,10Aug2004

client2,sharer2,753-1245,25Jan2007,18Jul2007

client2,sharer2,656-8989,12Nov1998,26Feb1999

client2,sharer2,656-8989,10Dec1999,10Jan2000

client2,sharer2,141-1414,23Aug2000,26Mar2001

client2,sharer2,141-1414,07Jun2001,10Jul2001

client2,sharer2,363-3636,19Jun1998,12Nov1998

client2,sharer2,852-8525,18Jun2009,02Jun2010

client2,sharer2,852-8525,20Oct2010,16May2011

client2,sharer2,852-8525,31May2012,10Dec2012

client2,sharer2,565-5656,05Sep1994,01Nov1994

client2,sharer2,565-5656,14Nov1994,14Nov1994

;

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client2,sharer2,31Aug1992,23Aug1993

client2,sharer2,25Dec1993,26Mar1994

client2,sharer2,10Aug1994,01Nov1994

client2,sharer2,15Mar1995,20Mar1995

client2,sharer2,19Jun1998,18Feb1999

client2,sharer2,10Dec1999,31Jan2000

client2,sharer2,20Jun2001,12Nov2001

client2,sharer2,10Apr2002,27Jul2002

client2,sharer2,18Dec2003,06Aug2004

client2,sharer2,11Dec2006,17Jul2007

client2,sharer2,18Jun2009,01Jun2010

client2,sharer2,20Oct2010,13May2011

;

All clinet1/sharer1 data should appear as they have no paid periods. There are also anomalies in the client2/sharer2 data as well.

Any ideas?

art297
Opal | Level 21

Had to bring out the JackHammer.  The following might get around the various anomolies.

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,sharer1,555-6532,21Nov2011,10Dec2012

client1,sharer1,444-5655,29Nov2010,14Feb2011

client1,sharer1,333-1234,20May1993,17Aug1993

client1,sharer1,333-1234,08Sep1993,08Sep1993

client2,sharer2,666-6548,10Jul2001,12Nov2001

client2,sharer2,666-6548,10Apr2002,06Aug2002

client2,sharer2,111-5658,02Nov1992,12Aug1993

client2,sharer2,222-6589,10Jan2000,31Jan2000

client2,sharer2,777-8755,31Jan2000,03Feb2000

client2,sharer2,777-8755,25Jun2009,14Sep2009

client2,sharer2,321-6544,18Dec2003,08Apr2004

client2,sharer2,778-6589,07Jun2001,10Jul2001

client2,sharer2,999-9988,31Dec1993,26Mar1994

client2,sharer2,999-9988,28Mar1994,28Mar1994

client2,sharer2,888-7845,12Aug1993,23Aug1993

client2,sharer2,789-9876,10Aug1994,05Sep1994

client2,sharer2,789-9876,22Jun1995,10Jul1995

client2,sharer2,951-6235,08Apr2004,10Aug2004

client2,sharer2,753-1245,25Jan2007,18Jul2007

client2,sharer2,656-8989,12Nov1998,26Feb1999

client2,sharer2,656-8989,10Dec1999,10Jan2000

client2,sharer2,141-1414,23Aug2000,26Mar2001

client2,sharer2,141-1414,07Jun2001,10Jul2001

client2,sharer2,363-3636,19Jun1998,12Nov1998

client2,sharer2,852-8525,18Jun2009,02Jun2010

client2,sharer2,852-8525,20Oct2010,16May2011

client2,sharer2,852-8525,31May2012,10Dec2012

client2,sharer2,565-5656,05Sep1994,01Nov1994

client2,sharer2,565-5656,14Nov1994,14Nov1994

;

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client2,sharer2,31Aug1992,23Aug1993

client2,sharer2,25Dec1993,26Mar1994

client2,sharer2,10Aug1994,01Nov1994

client2,sharer2,15Mar1995,20Mar1995

client2,sharer2,19Jun1998,18Feb1999

client2,sharer2,10Dec1999,31Jan2000

client2,sharer2,20Jun2001,12Nov2001

client2,sharer2,10Apr2002,27Jul2002

client2,sharer2,18Dec2003,06Aug2004

client2,sharer2,11Dec2006,17Jul2007

client2,sharer2,18Jun2009,01Jun2010

client2,sharer2,20Oct2010,13May2011

;

proc sort data=phone out=unique_phone (keep=clientid sharerid phone) nodupkey;

  by clientid sharerid phone;

run;

proc sort data=phone ;

  by clientid sharerid phone;

run;

proc sql;

  create table paid_periods_expanded

    as select a.paystart,a.payend,b.phone as phone2,

           b.clientid,b.sharerid

      from unique_phone b left join paid_periods a

         on a.clientid=b.clientid and

                a.sharerid=b.sharerid

        order by clientid,sharerid,phone,paystart

  ;

quit;

data paid_periods_expanded;

  set paid_periods_expanded;

  if missing(paystart) then do;

    paystart='02jan1960'd;

    payend='02jan1960'd;

  end;

run;

data want (keep=clientid sharerid phone date_range);

  set phone;

  by clientid sharerid phone;

  array dates(40000);

  retain dates: lowest highest;

  if first.phone then do;

    call missing(of dates(*));

    call missing(lowest);

    call missing(highest);

    call missing(first);

    call missing(last);

  end;

  lowest=min(lowest,startdt);

  highest=max(highest,enddt);

  do i=startdt to enddt;

    dates(i)=i;

  end;

  if last.phone then do;

    do until (last.phone2);

      set paid_periods_expanded

          (rename=(clientid=clientid2 sharerid=sharerid2));

      by clientid2 sharerid2 phone2;

      do i=paystart to payend;

        if not missing(dates(i)) then call missing(dates(i));

      end;

      if last.phone2 then do;

        do i=lowest to highest;

          if missing(first) and not missing(dates(i)) then first=dates(i);

          if not missing(first) and missing(dates(i+1)) then do;

            last=dates(i);

            date_range=catx("-",put(first,date9.),put(last,date9.));

            output;

            call missing(first);

            call missing(last);

          end;

        end;

      end;

    end;

  end;

run;

Haikuo
Onyx | Level 15

Since Art has mentioned 'JackHammer', I had to bring up Hash(). Please note that you can't use it if your second table can't be fitted into your RAM.

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,sharer1,555-6532,21Nov2011,10Dec2012

client1,sharer1,444-5655,29Nov2010,14Feb2011

client1,sharer1,333-1234,20May1993,17Aug1993

client1,sharer1,333-1234,08Sep1993,08Sep1993

client2,sharer2,666-6548,10Jul2001,12Nov2001

client2,sharer2,666-6548,10Apr2002,06Aug2002

client2,sharer2,111-5658,02Nov1992,12Aug1993

client2,sharer2,222-6589,10Jan2000,31Jan2000

client2,sharer2,777-8755,31Jan2000,03Feb2000

client2,sharer2,777-8755,25Jun2009,14Sep2009

client2,sharer2,321-6544,18Dec2003,08Apr2004

client2,sharer2,778-6589,07Jun2001,10Jul2001

client2,sharer2,999-9988,31Dec1993,26Mar1994

client2,sharer2,999-9988,28Mar1994,28Mar1994

client2,sharer2,888-7845,12Aug1993,23Aug1993

client2,sharer2,789-9876,10Aug1994,05Sep1994

client2,sharer2,789-9876,22Jun1995,10Jul1995

client2,sharer2,951-6235,08Apr2004,10Aug2004

client2,sharer2,753-1245,25Jan2007,18Jul2007

client2,sharer2,656-8989,12Nov1998,26Feb1999

client2,sharer2,656-8989,10Dec1999,10Jan2000

client2,sharer2,141-1414,23Aug2000,26Mar2001

client2,sharer2,141-1414,07Jun2001,10Jul2001

client2,sharer2,363-3636,19Jun1998,12Nov1998

client2,sharer2,852-8525,18Jun2009,02Jun2010

client2,sharer2,852-8525,20Oct2010,16May2011

client2,sharer2,852-8525,31May2012,10Dec2012

client2,sharer2,565-5656,05Sep1994,01Nov1994

client2,sharer2,565-5656,14Nov1994,14Nov1994

;

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client2,sharer2,31Aug1992,23Aug1993

client2,sharer2,25Dec1993,26Mar1994

client2,sharer2,10Aug1994,01Nov1994

client2,sharer2,15Mar1995,20Mar1995

client2,sharer2,19Jun1998,18Feb1999

client2,sharer2,10Dec1999,31Jan2000

client2,sharer2,20Jun2001,12Nov2001

client2,sharer2,10Apr2002,27Jul2002

client2,sharer2,18Dec2003,06Aug2004

client2,sharer2,11Dec2006,17Jul2007

client2,sharer2,18Jun2009,01Jun2010

client2,sharer2,20Oct2010,13May2011

;

data want_1;

   if _n_=1 then do;

      if 0 then set paid_periods;

        declare hash h(dataset:'paid_periods',multidata:'y');

h.definekey('clientid', 'sharerid');

h.definedata('paystart', 'payend');

h.definedone();

      end;

      set phone;

_rc=h.find();

        do _rc=0 by 0 while (_rc=0);

if paystart<= startdt <= payend then startdt=payend+1;

         if  paystart<= enddt <= payend then enddt=paystart-1;

             if startdt > enddt then leave;

         _rc=h.find_next();

        end;

        if enddt>= startdt then output;

        drop paystart payend _rc;

run;

Haikuo

Astounding
PROC Star

SteveNZ,

A couple of suggestions ...

First, it's difficult to believe that you have found all the possible permutations.  It seems reasonable that a paid period might fall in the middle of a worked period, and a single work period could generate two date ranges that are unpaid.  That being said, it might be safer to generate a separate observation for each day that is unpaid, and to aggregate those single days into ranges afterwards.

Second, this type of problem can be handled fairly cleanly by creating a format.  The format would translate the paid periods into the word "Paid".  For example, assuming the data set PAID_PERIODS exists:

data paid_periods2;

   set paid_periods;

   retain fmtname '$paid' label 'Paid';

   start = clientid || sharerid || put(paystart, yymmddn8.);

   end = clientid || sharerid || put(payend, yymmddn8.);

   keep fmtname start end label;

run;

proc format cntlin=paid_periods2;

run;

This type of format could easily output individual days that haven't been paid, along these lines:

do day = startdt to enddt;

   if put(clientid || sharerid || put(day, yymmddn8.), $paid.) ne 'Paid' then output;

end;

This approach generates the list of individual days, because of my first suggestion above.  The plan would be to aggregate the individual days into ranges later.  If you are 100% certain that you will never need two ranges instead or one, then a more complex loop could be constructed.  But first, let's see if we have found the right set of rules and assumptions.

Good luck.

art297
Opal | Level 21

: JackHammer?  Not at all!  This, IMHO, is an excellent use of a hash and I liked the way that you approached it!

I tried a number of variations of the input data and, in each case, both of our suggested sets of code produced the same results.

I modified your code, just a bit, to produce the output that the OP had originally asked for:

data want;

  if _n_=1 then do;

    if 0 then set paid_periods;

    declare hash h(dataset:'paid_periods',multidata:'y');

    h.definekey('clientid', 'sharerid');

    h.definedata('paystart', 'payend');

    h.definedone();

  end;

  set phone;

  _rc=h.find();

  do _rc=0 by 0 while (_rc=0);

    if paystart<= startdt <= payend then startdt=payend+1;

    if  paystart<= enddt <= payend then enddt=paystart-1;

    if startdt > enddt then leave;

    _rc=h.find_next();

  end;

  if enddt>= startdt then do;

    date_range=catx("-",put(startdt,date9.),put(enddt,date9.));

    output;

  end;

  drop paystart payend _rc startdt enddt;

run;

: I, for one, would be interested in seeing the full format approach that produces the same result.

: Haikuo's code is much better than mine.  I think you ought to unmark mine as being correct and, instead, give the credit to Haikuo.

Astounding
PROC Star

Art,

Here's the rest of the format approach.  It picks up right after creating the format.  If I were going for clarity rather than speed, I would go about it this way:

data individual_unpaid_days;

   set phone;

   do day=startdt to enddt;

        if put(clientid || sharerid || put(day, yymmddn8.), $paid.) ne 'Paid' then output;

   end;

   drop startdt enddt;

run;

data assign_group;

   set individual_unpaid_days;

   by clientid sharerid phone;

   prior_day = lag(day);

   if first.phone or prior_day + 1 ne day then groupnum + 1;

   drop prior_day;

run;

data want;

   set assign_group;

   by clientid sharerid phone groupnum;

   if first.groupnum then start_of_range = day;

   retain start_of_range;

   if last.groupnum then do;

      end_of_range=day;

      output;

   end;

   drop day;

run;

Sorry, but all of this is untested code.  If I were going for speed rather than clarity, I might replace all three steps with:

data want;

   set phone;

   do day=startdt to enddt;

        result = put(clientid || sharerid || put(day, yymmddn8.), $paid.);

        if result ne 'Paid' then do;

           if start_of_range=. then start_of_range=day;

           end_of_range=day;

       end;

       else do;

           if start_of_range > . then output;

           start_of_range=.;

       end;

   end;

   if start_of_range > . then output;

   drop result day;

run; 

This actually allows for any number of unpaid groups of days appearing between startdt and enddt and outputs a separate observation for each group ... assuming it works!

art297
Opal | Level 21

: Nice!  Needed one extra sort routine, and I added the line to create the date_range variable, but your code ends up with the same results as Haikuo's and my code.

data phone ;

  infile datalines delimiter=',';

  input clientid $ sharerid $ phone $ startdt enddt ;

  informat startdt enddt date9. ;

  format startdt enddt date9. ;

  datalines;

client1,sharer1,555-6532,21Nov2011,10Dec2012

client1,sharer1,444-5655,29Nov2010,14Feb2011

client1,sharer1,333-1234,20May1993,17Aug1993

client1,sharer1,333-1234,08Sep1993,08Sep1993

client2,sharer2,666-6548,10Jul2001,12Nov2001

client2,sharer2,666-6548,10Apr2002,06Aug2002

client2,sharer2,111-5658,02Nov1992,12Aug1993

client2,sharer2,222-6589,10Jan2000,31Jan2000

client2,sharer2,777-8755,31Jan2000,03Feb2000

client2,sharer2,777-8755,25Jun2009,14Sep2009

client2,sharer2,321-6544,18Dec2003,08Apr2004

client2,sharer2,778-6589,07Jun2001,10Jul2001

client2,sharer2,999-9988,31Dec1993,26Mar1994

client2,sharer2,999-9988,28Mar1994,28Mar1994

client2,sharer2,888-7845,12Aug1993,23Aug1993

client2,sharer2,789-9876,10Aug1994,05Sep1994

client2,sharer2,789-9876,22Jun1995,10Jul1995

client2,sharer2,951-6235,08Apr2004,10Aug2004

client2,sharer2,753-1245,25Jan2007,18Jul2007

client2,sharer2,656-8989,12Nov1998,26Feb1999

client2,sharer2,656-8989,10Dec1999,10Jan2000

client2,sharer2,141-1414,23Aug2000,26Mar2001

client2,sharer2,141-1414,07Jun2001,10Jul2001

client2,sharer2,363-3636,19Jun1998,12Nov1998

client2,sharer2,852-8525,18Jun2009,02Jun2010

client2,sharer2,852-8525,20Oct2010,16May2011

client2,sharer2,852-8525,31May2012,10Dec2012

client2,sharer2,565-5656,05Sep1994,01Nov1994

client2,sharer2,565-5656,14Nov1994,14Nov1994

;

data paid_periods;

  infile datalines delimiter=',';

  input clientid $ sharerid $ paystart payend ;

  informat paystart payend date9. ;

  format paystart payend date9. ;

  datalines;

client2,sharer2,31Aug1992,23Aug1993

client2,sharer2,25Dec1993,26Mar1994

client2,sharer2,10Aug1994,01Nov1994

client2,sharer2,15Mar1995,20Mar1995

client2,sharer2,19Jun1998,18Feb1999

client2,sharer2,10Dec1999,31Jan2000

client2,sharer2,20Jun2001,12Nov2001

client2,sharer2,10Apr2002,27Jul2002

client2,sharer2,18Dec2003,06Aug2004

client2,sharer2,11Dec2006,17Jul2007

client2,sharer2,18Jun2009,01Jun2010

client2,sharer2,20Oct2010,13May2011

;

data paid_periods2;

   set paid_periods;

   retain fmtname '$paid' label 'Paid';

   start = clientid || sharerid || put(paystart, yymmddn8.);

   end = clientid || sharerid || put(payend, yymmddn8.);

   keep fmtname start end label;

run;

proc format cntlin=paid_periods2;

run;

data individual_unpaid_days;

   set phone;

   do day=startdt to enddt;

        if put(clientid || sharerid || put(day, yymmddn8.), $paid.) ne 'Paid' then output;

   end;

   drop startdt enddt;

run;

proc sort data=individual_unpaid_days;

  by clientid sharerid phone;

run;

data assign_group;

   set individual_unpaid_days;

   by clientid sharerid phone;

   prior_day = lag(day);

   if first.phone or prior_day + 1 ne day then groupnum + 1;

   drop prior_day;

run;

data want_2;

   set assign_group;

   by clientid sharerid phone groupnum;

   if first.groupnum then start_of_range = day;

   retain start_of_range;

   if last.groupnum then do;

      end_of_range=day;

      date_range=catx("-",put(start_of_range,date9.),put(end_of_range,date9.));

      output;

   end;

   drop day groupnum start_: end_:;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 1453 views
  • 6 likes
  • 4 in conversation