Dear all,
I have a two dataset, in data one, for each id per reason taken stdt and endt. and on the other two dataset it has trgt date and rate
data one;
input id reas stdt endt;
datalines;
100 xxxx 15NOV2013 28NOV2015
100 yyyy 26FEB2014 05MAR2014
;
run;
data two;
input id tgtdt rate
100 30OCT2013 5.000
100 01NOV2013 5.000
100 04NOV2013 5.000
100 06NOV2013 5.000
100 08NOV2013 5.000
100 11NOV2013 5.000
100 13NOV2013 5.000
100 15NOV2013 5.000
100 18NOV2013 0.000
100 20NOV2013 0.000
100 22NOV2013 0.000
100 25NOV2013 0.000
100 27NOV2013 0.000
100 29NOV2013 5.000
100 10FEB2014 5.000
100 12FEB2014 5.000
100 14FEB2014 5.000
100 17FEB2014 5.000
100 19FEB2014 5.000
100 21FEB2014 5.000
100 24FEB2014 5.000
100 26FEB2014 5.000
100 28FEB2014 0.000
100 03MAR2014 0.000
100 05MAR2014 0.000
100 10MAR2014 2.500
100 14MAR2014 2.500
100 17MAR2014 2.500
100 19MAR2014 2.500
need to derive two dates, strtgtdt and endtgtdt . These date are after the stdt per each indvidaul reas and rate eq 0.
In the above example :
for reas
1)XXXX : stdt is 15NOV2013 check the data against the two dataset tgtdt , 18NOV2013 is the after the stdt, and aval is 0 so new variable strtgtdt is 18NOV2013 and endtdt(with Consecutive 0) upto 27NOV2013.pick the next date with aval not eq 0, so ENDTGTDT is 29NOV2013
want:
id reas stdt endt strtgtdt endtgtdt
100 xxxx 15NOV2013 28NOV2015 18NOV2013 29NOV2013
100 yyyy 26FEB2014 05MAR2014 28FEB2014 10MAR2014
Thanks
Cathy
Hi Cathy,
Try this:
data date0;
do until(last.rate);
set two;
by id rate notsorted;
if first.rate & rate=0 then strtgtdt=tgtdt;
end;
endtgtdt=tgtdt;
if strtgtdt>. then output;
drop tgtdt rate;
format strtgtdt endtgtdt date9.;
run;
proc sql;
create table want as
select x.*, strtgtdt, endtgtdt
from one x left join
(select b.*, reas
from one a, date0 b
where a.id=b.id & .<stdt<strtgtdt
group by b.id, reas
having strtgtdt=min(strtgtdt)) y
on x.id=y.id & x.reas=y.reas;
quit;
proc print data=want width=min;
run;
I'll give explanations in a separate post.
Edit: corrected the BY statement in the data step.
For each ID, dataset DATE0 contains one observation per block of consecutive observations in dataset TWO with RATE=0. Variables are ID, STRTGTDT (=first TGTDT in the block) and ENDTGTDT (=last TGTDT in the block). It is assumed that dataset TWO is sorted by ID. If it is also sorted by TGTDT within each ID (as is the case in your example data), the dates STRTGTDT and ENDTGTDT are not only the "first" and "last" in the respective block in terms of their position in the dataset, but also chronologically (i.e. earliest and latest).
Now let's consider the PROC SQL step. The inline view with the alias y looks at groups of observations combined from datasets ONE and DATE0 with common ID and REAS and with the condition that STRTGTDT is a date after STDT. (In particular, cases with STDT=STRTGTDT would not be selected, but this could be changed easily if required.) Within these groups it selects those observations with the minimum STRTGTDT of the group. In your example data this minimum condition uniquely determines one observation in the group. If this was not the case in your real data, we might need to insert a "DISTINCT" into the code.
Finally, the primary SELECT statement in the step adds the date pair(s) of STRTGTDT and ENDTGTDT resulting from the inline view to the data from the matching observation of dataset ONE, where "matching" means having the same ID and REAS.
So, in case that there is no block in dataset TWO with RATE=0 and TGTDT after the STDT of a particular ID and REAS from dataset ONE, both STRTGTDT and ENDTGTDT will be missing. But there is no such case in your example data.
Thank you So much for nice explanation!!! I modified the original post, where the endtgtdt criteria is updated!!!
where we need to consider the earliest date after the 0 rate
Thanks
Cathy
And here is my updated code: Please replace the original data step by these two data steps.
data date1;
set two;
by id rate notsorted;
if first.rate;
run;
data date0;
retain strtgtdt;
set date1;
by id;
if rate=0 then strtgtdt=tgtdt;
else if strtgtdt>. then endtgtdt=tgtdt;
if strtgtdt>. & (endtgtdt>. | last.id) then do;
output;
strtgtdt=.;
end;
drop tgtdt rate;
format strtgtdt endtgtdt date9.;
run;
Here is the updated explanation:
Dataset DATE1 simply contains the first observation from each block of consecutive observations in dataset TWO with identical ID and RATE values.
The second datastep goes through dataset DATE1. When it hits an obs. with RATE=0 it stores the TGTDT date in variable STRTGTDT. If, after STRTGTDT has been set, an obs. with RATE ne 0 is reached, the TGTDT value of this observation is written to variable ENDTGTDT. If the pair (STRTGTDT, ENDTGTDT) is complete or if the last obs. of the ID has been reached, the ID and the two date variables are written to dataset DATE0 and STRTGTDT is reset to missing. (Please note that ENDTGTDT in DATE0 can have a missing value if the last observation of the ID in dataset TWO has RATE=0.)
I had a very similar proposal. I hope @cathy_sas can adapt your code to her updated requirements.
Thank you !!! it worked, but i updated the orginal post, endtgdt criteria is update. i need to pick next record of the rate recontinued value( i mean not a zero)
Thanks
Cathy
Here is my updated code:
data two2;
set two;
by id rate notsorted;
if first.rate;
if rate ne 0 then count+1;
if rate = 0 then count=1;
lagdate=lag(tgtdt);
strtgtdt=lagdate;
endtgtdt=tgtdt;
format strtgtdt endtgtdt date9.;
if count=2;
run;
proc sql;
create table want as
select a.id,reas,stdt,endt,strtgtdt,endtgtdt
from one a inner join
two2 b on
stdt < strtgtdt;
proc sort data=want nodupkey;by id reas stdt endt;
Here is a solution:
data two2;
set two;
by id rate notsorted;
if first.rate or last.rate;
if rate = 0;
count+1;
if count=3 then count=1;
lagdate=lag(tgtdt);
if count = 2 then do;
strtgtdt=lagdate;
endtgtdt=tgtdt;
end;
if count=2;
format strtgtdt endtgtdt date9.;
run;
proc sql;
create table want as
select a.id,reas,stdt,endt,strtgtdt,endtgtdt
from one a inner join
two2 b on
stdt < strtgtdt < endtgtdt <= endt;
proc sort data=want nodupkey;by id reas stdt endt;
@cathy_sas: So, now you've got two solutions. They yield identical results with the test data you provided. However, they do not yield identical results on all possible input data similar to datasets ONE and TWO one could think of, because they make different assumptions. Hence, they may or may not yield identical results with your real data.
Therefore, I would recommend that you test the suggested solutions with more input datasets and compare the results to your expectations.
In short, try to make your test data as realistic as possible.
Thank you for your brief explanation!!! it is really a good learning curve!!!! Yes i agreed with you my real data is more complicated than the sample data which i have it here. but definately your sample code is starting point... i am working and need to see more in to the data, any issues causing to get wrong values!!!
I have a small question , if i want to create a 2 variable prevrate and prvdt stdt < tgtdt , assign tgtdt & correcpoding rate
d reas stdt endt strtgtdt endtgtdt prvdt prevrate
100 xxxx 15NOV2013 28NOV2015 18NOV2013 29NOV2013 15NOV2013 5.000
100 yyyy 26FEB2014 05MAR2014 28FEB2014 10MAR2014 26FEB2014 5.000
Thanks
cathy
Hi Cathy,
I have prepared code that incorporates your "small question."
The extended specifications were a good opportunity for me to redesign the data step that prepares the second input dataset for the pivotal PROC SQL step. It is now more flexible in view of possible future requirements.
I am not quite sure if I interpreted your new sample result data and the pertinent description "create a 2 variable prevrate and prvdt stdt < tgtdt , assign tgtdt & correcpoding rate" correctly. This is because I don't see where the inequality "stdt < tgtdt" fits in there. In the example we have stdt=prvdt in both observations, but this might be coincidence.
Please elaborate on this point if you think that the new solution suggested below does not account for it.
Here's the new code:
/* Aggregate information from each block of obs. (per ID) with constant RATE */
data blocks;
do until(last.rate);
set two;
by id rate notsorted;
if first.rate then date1=tgtdt;
end;
datelast=tgtdt;
drop tgtdt;
format date: date9.;
run;
/* Focus on blocks with RATE=0 and add information from previous and next block */
data blocks0(drop=obsno rate);
set blocks;
by id;
obsno+1;
if rate=0 then do;
pnt=obsno-1;
if ~first.id then set blocks(rename=(date1 = prevdate1
datelast = prevdatelast
rate = prevrate)) point=pnt;
else call missing(of prev:);
pnt=obsno+1;
if ~last.id then set blocks(rename=(date1 = nextdate1
datelast = nextdatelast
rate = nextrate)) point=pnt;
else call missing(of next:);
output;
end;
drop datelast prevdate1 nextdatelast nextrate; /* not needed now, but perhaps with future specs */
run;
/* Create and show the desired output dataset */
proc sql;
create table want as
select x.*, strtgtdt, endtgtdt, prvdt, prevrate format=8.3
from one x left join
(select b.*, reas
from one a, blocks0(rename=(date1=strtgtdt nextdate1=endtgtdt prevdatelast=prvdt)) b
where a.id=b.id & .<stdt<strtgtdt
group by b.id, reas
having strtgtdt=min(strtgtdt)) y
on x.id=y.id & x.reas=y.reas;
quit;
proc print data=want width=min;
run;
Explanations will follow soon.
Explanation of revised code and datasets
Instead of the previous datasets DATE1 and DATE0, we have now more versatile datasets BLOCKS and BLOCKS0.
For each ID, dataset BLOCKS contains one observation per block of consecutive observations in dataset TWO with the same RATE (be it 0 or not). Variables are ID, RATE, DATE1 (=first TGTDT in the block) and DATELAST (=last TGTDT in the block). So, it's similar to the first version of the old dataset DATE0, but, unlike DATE0, it is not restricted to blocks with RATE=0. The reason is that we need information from blocks with RATE not equal to zero.
In contrast, dataset BLOCKS0 is restricted to blocks with RATE=0 (hence its name) and has again one obs. per block. It is created by the second data step, which reads the observations from dataset BLOCKS with RATE=0 and the previous and the following obs. of each of these observations. The look-back and look-ahead techniques used are the same: second SET statement with POINT= option.
As a result, BLOCKS0 contains the ID and DATE1 ("first date") values from each block (in dataset TWO) with RATE=0 and, in addition,
I think, the names of these three additional variables are self-explanatory: PREVRATE, PREVDATELAST and NEXTDATE1. Please note that the second data step creates even more variables: PREVDATE1, NEXTDATELAST and NEXTRATE. But these three are dropped at the end, as is DATELAST from the "zero rate observation", because they are not needed in your current requirements. Obviously, we could make these pieces of information (first date of previous block, last date and rate of next block, last date of current block) available immediately, if needed in future specifications.
Of course, I've made sure that "previous" and "following" blocks are only regarded within the same ID. This means, if the very first block of an ID had RATE=0, the PREVxxxx variables would have missing values. Similarly, the NEXTxxxx variables of a "zero rate" block would have missing values, if this was the last block of the respective ID.
Only minor changes were required for the final PROC SQL step: It now retrieves data from dataset BLOCKS0 rather than DATE0. The RENAME= option used provides flexibility with regard to the names of the new variables. Three variables from BLOCKS0 are renamed to match your specifications. The fourth new variable, PREVRATE, can keep its original name. The formatting of the latter with three decimals (using format 8.3) is, of course, optional and was just suggested by your sample data.
Again, please test the solution with more sample (or real) data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.