Re: checking the dates with 0 rates

cathy_sas · Posted 01-22-2016 01:24 PM

Dear all,

I have a two dataset, in data one, for each id per reason taken stdt and endt. and on the other two dataset it has trgt date and rate

data one;

input id reas stdt endt;

datalines;

100 xxxx 15NOV2013 28NOV2015

100 yyyy 26FEB2014 05MAR2014

;

run;

data two;

input id tgtdt rate

100 30OCT2013 5.000

100 01NOV2013 5.000

100 04NOV2013 5.000

100 06NOV2013 5.000

100 08NOV2013 5.000

100 11NOV2013 5.000

100 13NOV2013 5.000

100 15NOV2013 5.000

100 18NOV2013 0.000

100 20NOV2013 0.000

100 22NOV2013 0.000

100 25NOV2013 0.000

100 27NOV2013 0.000

100 29NOV2013 5.000

100 10FEB2014 5.000

100 12FEB2014 5.000

100 14FEB2014 5.000

100 17FEB2014 5.000

100 19FEB2014 5.000

100 21FEB2014 5.000

100 24FEB2014 5.000

100 26FEB2014 5.000

100 28FEB2014 0.000

100 03MAR2014 0.000

100 05MAR2014 0.000

100 10MAR2014 2.500

100 14MAR2014 2.500

100 17MAR2014 2.500

100 19MAR2014 2.500

need to derive two dates, strtgtdt and endtgtdt . These date are after the stdt per each indvidaul reas and rate eq 0.

In the above example :

for reas

1)XXXX : stdt is 15NOV2013 check the data against the two dataset tgtdt , 18NOV2013 is the after the stdt, and aval is 0 so new variable strtgtdt is 18NOV2013 and endtdt(with Consecutive 0) upto 27NOV2013.pick the next date with aval not eq 0, so ENDTGTDT is 29NOV2013

want:

id reas stdt endt strtgtdt endtgtdt

100 xxxx 15NOV2013 28NOV2015 18NOV2013 29NOV2013

100 yyyy 26FEB2014 05MAR2014 28FEB2014 10MAR2014

Thanks

Cathy

FreelanceReinh · Posted 01-22-2016 02:46 PM

Hi Cathy,

Try this:

data date0;
do until(last.rate);
  set two;
  by id rate notsorted;
  if first.rate & rate=0 then strtgtdt=tgtdt;
end;
endtgtdt=tgtdt;
if strtgtdt>. then output;
drop tgtdt rate;
format strtgtdt endtgtdt date9.;
run;

proc sql;
create table want as
select x.*, strtgtdt, endtgtdt
from one x left join
(select b.*, reas
 from one a, date0 b
 where a.id=b.id & .<stdt<strtgtdt
 group by b.id, reas
 having strtgtdt=min(strtgtdt)) y
on x.id=y.id & x.reas=y.reas;
quit;

proc print data=want width=min;
run;

I'll give explanations in a separate post.

Edit: corrected the BY statement in the data step.

FreelanceReinh · Posted 01-22-2016 03:30 PM

For each ID, dataset DATE0 contains one observation per block of consecutive observations in dataset TWO with RATE=0. Variables are ID, STRTGTDT (=first TGTDT in the block) and ENDTGTDT (=last TGTDT in the block). It is assumed that dataset TWO is sorted by ID. If it is also sorted by TGTDT within each ID (as is the case in your example data), the dates STRTGTDT and ENDTGTDT are not only the "first" and "last" in the respective block in terms of their position in the dataset, but also chronologically (i.e. earliest and latest).

Now let's consider the PROC SQL step. The inline view with the alias y looks at groups of observations combined from datasets ONE and DATE0 with common ID and REAS and with the condition that STRTGTDT is a date after STDT. (In particular, cases with STDT=STRTGTDT would not be selected, but this could be changed easily if required.) Within these groups it selects those observations with the minimum STRTGTDT of the group. In your example data this minimum condition uniquely determines one observation in the group. If this was not the case in your real data, we might need to insert a "DISTINCT" into the code.

Finally, the primary SELECT statement in the step adds the date pair(s) of STRTGTDT and ENDTGTDT resulting from the inline view to the data from the matching observation of dataset ONE, where "matching" means having the same ID and REAS.

So, in case that there is no block in dataset TWO with RATE=0 and TGTDT after the STDT of a particular ID and REAS from dataset ONE, both STRTGTDT and ENDTGTDT will be missing. But there is no such case in your example data.

cathy_sas · Posted 01-22-2016 03:39 PM

Thank you So much for nice explanation!!! I modified the original post, where the endtgtdt criteria is updated!!!

where we need to consider the earliest date after the 0 rate

Thanks

Cathy

FreelanceReinh · Posted 01-22-2016 04:19 PM

And here is my updated code: Please replace the original data step by these two data steps.

data date1;
set two;
by id rate notsorted;
if first.rate;
run;

data date0;
retain strtgtdt;
set date1;
by id;
if rate=0 then strtgtdt=tgtdt;
else if strtgtdt>. then endtgtdt=tgtdt;
if strtgtdt>. & (endtgtdt>. | last.id) then do;
  output;
  strtgtdt=.;
end;
drop tgtdt rate;
format strtgtdt endtgtdt date9.;
run;

FreelanceReinh · Posted 01-22-2016 04:38 PM

Here is the updated explanation:

Dataset DATE1 simply contains the first observation from each block of consecutive observations in dataset TWO with identical ID and RATE values.

The second datastep goes through dataset DATE1. When it hits an obs. with RATE=0 it stores the TGTDT date in variable STRTGTDT. If, after STRTGTDT has been set, an obs. with RATE ne 0 is reached, the TGTDT value of this observation is written to variable ENDTGTDT. If the pair (STRTGTDT, ENDTGTDT) is complete or if the last obs. of the ID has been reached, the ID and the two date variables are written to dataset DATE0 and STRTGTDT is reset to missing. (Please note that ENDTGTDT in DATE0 can have a missing value if the last observation of the ID in dataset TWO has RATE=0.)

PGStats · Posted 01-22-2016 03:48 PM

I had a very similar proposal. I hope @cathy_sas can adapt your code to her updated requirements.

PG

cathy_sas · Posted 01-22-2016 03:34 PM

Thank you !!! it worked, but i updated the orginal post, endtgdt criteria is update. i need to pick next record of the rate recontinued value( i mean not a zero)

Thanks

Cathy

Steelers_In_DC · Posted 01-22-2016 04:04 PM

Here is my updated code:

data two2;
set two;
by id rate notsorted;
if first.rate;
if rate ne 0 then count+1;
if rate = 0 then count=1;
lagdate=lag(tgtdt);
   strtgtdt=lagdate;
   endtgtdt=tgtdt;
format strtgtdt endtgtdt date9.;
if count=2;
run;

proc sql;
create table want as
select a.id,reas,stdt,endt,strtgtdt,endtgtdt
from one a inner join
     two2 b on
stdt < strtgtdt;

proc sort data=want nodupkey;by id reas stdt endt;

Steelers_In_DC · Posted 01-22-2016 03:13 PM

Here is a solution:

data two2;
set two;
by id rate notsorted;
if first.rate or last.rate;
if rate = 0;
count+1;
if count=3 then count=1;
lagdate=lag(tgtdt);
if count = 2 then do;
   strtgtdt=lagdate;
   endtgtdt=tgtdt;
end;
if count=2;
format strtgtdt endtgtdt date9.;
run;

proc sql;
create table want as
select a.id,reas,stdt,endt,strtgtdt,endtgtdt
from one a inner join
     two2 b on
stdt < strtgtdt < endtgtdt <= endt;

proc sort data=want nodupkey;by id reas stdt endt;

FreelanceReinh · Posted 01-22-2016 06:26 PM

@cathy_sas: So, now you've got two solutions. They yield identical results with the test data you provided. However, they do not yield identical results on all possible input data similar to datasets ONE and TWO one could think of, because they make different assumptions. Hence, they may or may not yield identical results with your real data.

Therefore, I would recommend that you test the suggested solutions with more input datasets and compare the results to your expectations.

Think of special, possibly "degenerate" cases that may occur (or in fact do occur) in your real data.
Avoid oversimplified test data (e.g. would it be realistic that the RATE is constant from the very first until the last obs. before it suddenly drops to 0?)
Also, be aware of potential numeric representation issues if the RATEs are not integers or have been calculated from non-integer values. (This means: It could happen that consecutive RATEs look as if they were identical in PROC PRINT output, even in BEST32. format, but in fact they are slightly different.)

In short, try to make your test data as realistic as possible.

cathy_sas · Posted 01-22-2016 07:43 PM

Thank you for your brief explanation!!! it is really a good learning curve!!!! Yes i agreed with you my real data is more complicated than the sample data which i have it here. but definately your sample code is starting point... i am working and need to see more in to the data, any issues causing to get wrong values!!!

I have a small question , if i want to create a 2 variable prevrate and prvdt stdt < tgtdt , assign tgtdt & correcpoding rate

d reas stdt endt strtgtdt endtgtdt prvdt prevrate

100 xxxx 15NOV2013 28NOV2015 18NOV2013 29NOV2013 15NOV2013 5.000

100 yyyy 26FEB2014 05MAR2014 28FEB2014 10MAR2014 26FEB2014 5.000

Thanks

cathy

FreelanceReinh · Posted 01-23-2016 09:47 AM

Hi Cathy,

I have prepared code that incorporates your "small question."

The extended specifications were a good opportunity for me to redesign the data step that prepares the second input dataset for the pivotal PROC SQL step. It is now more flexible in view of possible future requirements.

I am not quite sure if I interpreted your new sample result data and the pertinent description "create a 2 variable prevrate and prvdt stdt < tgtdt , assign tgtdt & correcpoding rate" correctly. This is because I don't see where the inequality "stdt < tgtdt" fits in there. In the example we have stdt=prvdt in both observations, but this might be coincidence.

Please elaborate on this point if you think that the new solution suggested below does not account for it.

Here's the new code:

/* Aggregate information from each block of obs. (per ID) with constant RATE */

data blocks;
do until(last.rate);
  set two;
  by id rate notsorted;
  if first.rate then date1=tgtdt;
end;
datelast=tgtdt;
drop tgtdt;
format date: date9.;
run;

/* Focus on blocks with RATE=0 and add information from previous and next block */

data blocks0(drop=obsno rate);
set blocks;
by id;
obsno+1;
if rate=0 then do;
  pnt=obsno-1;
  if ~first.id then set blocks(rename=(date1    = prevdate1
                                       datelast = prevdatelast
                                       rate     = prevrate)) point=pnt;
  else call missing(of prev:);
  pnt=obsno+1;
  if ~last.id  then set blocks(rename=(date1    = nextdate1
                                       datelast = nextdatelast
                                       rate     = nextrate)) point=pnt;
  else call missing(of next:);
  output;
end;
drop datelast prevdate1 nextdatelast nextrate; /* not needed now, but perhaps with future specs */
run;

/* Create and show the desired output dataset */

proc sql;
create table want as
select x.*, strtgtdt, endtgtdt, prvdt, prevrate format=8.3
from one x left join
(select b.*, reas
 from one a, blocks0(rename=(date1=strtgtdt nextdate1=endtgtdt prevdatelast=prvdt)) b
 where a.id=b.id & .<stdt<strtgtdt
 group by b.id, reas
 having strtgtdt=min(strtgtdt)) y
on x.id=y.id & x.reas=y.reas;
quit;

proc print data=want width=min;
run;

Explanations will follow soon.

FreelanceReinh · Posted 01-23-2016 10:03 AM

Explanation of revised code and datasets

Instead of the previous datasets DATE1 and DATE0, we have now more versatile datasets BLOCKS and BLOCKS0.

For each ID, dataset BLOCKS contains one observation per block of consecutive observations in dataset TWO with the same RATE (be it 0 or not). Variables are ID, RATE, DATE1 (=first TGTDT in the block) and DATELAST (=last TGTDT in the block). So, it's similar to the first version of the old dataset DATE0, but, unlike DATE0, it is not restricted to blocks with RATE=0. The reason is that we need information from blocks with RATE not equal to zero.

In contrast, dataset BLOCKS0 is restricted to blocks with RATE=0 (hence its name) and has again one obs. per block. It is created by the second data step, which reads the observations from dataset BLOCKS with RATE=0 and the previous and the following obs. of each of these observations. The look-back and look-ahead techniques used are the same: second SET statement with POINT= option.

As a result, BLOCKS0 contains the ID and DATE1 ("first date") values from each block (in dataset TWO) with RATE=0 and, in addition,

the values RATE and DATELAST (see description of BLOCKS above) from the preceding block (i.e. with RATE ne 0)
the value DATE1 from the following block (again, necessarily with RATE ne 0).

I think, the names of these three additional variables are self-explanatory: PREVRATE, PREVDATELAST and NEXTDATE1. Please note that the second data step creates even more variables: PREVDATE1, NEXTDATELAST and NEXTRATE. But these three are dropped at the end, as is DATELAST from the "zero rate observation", because they are not needed in your current requirements. Obviously, we could make these pieces of information (first date of previous block, last date and rate of next block, last date of current block) available immediately, if needed in future specifications.

Of course, I've made sure that "previous" and "following" blocks are only regarded within the same ID. This means, if the very first block of an ID had RATE=0, the PREVxxxx variables would have missing values. Similarly, the NEXTxxxx variables of a "zero rate" block would have missing values, if this was the last block of the respective ID.

Only minor changes were required for the final PROC SQL step: It now retrieves data from dataset BLOCKS0 rather than DATE0. The RENAME= option used provides flexibility with regard to the names of the new variables. Three variables from BLOCKS0 are renamed to match your specifications. The fourth new variable, PREVRATE, can keep its original name. The formatting of the latter with three decimals (using format 8.3) is, of course, optional and was just suggested by your sample data.

Again, please test the solution with more sample (or real) data.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away