Hi SAS Community,
After some dicussions, I obtain the code as described below:
data want;
set have;
by type;
array _date(11712) _temporary_;
array _ri_us(11712) _temporary_;
retain _date: _ri_us: start;
if first.type then do;
call missing(of _date[*]);
call missing(of _ri_us[*]);
start = 0;
end;
start = start+1;
_date[start]=date;
_ri_us[start] = ri_us;
if last.type then do;
do i = 1 to start;
if i > 1 then do;
if (((1 + _ri_us[i]/100)*( 1 +_ri_us[i-1]/100) -1) <= 0.5 and (_ri_us[i] > 100 or _ri_us[i-1] > 100))
or (_ri_us[i] < 1 or _ri_us[i-1] < 1) then do;
ri_us=.;
date = _date[i];
output;
end;
else do;
ri_us=_ri_us[i];
date = _date[i];
output;
end;
end;
else do;
ri_us=_ri_us[i];
date = _date[i];
output;
end;
end;
end;
run;
And, yes, I receive the result for column ri_us. Unfortunately, all other columns are also affected, as described below:
dataset have before running this code:
Type date ri_us p_us up_us nosh vo
131712 01JAN1987 . . . . .
131712 02JAN1987 . . . . .
.
.
131712 30DEC2019 0.01 0 0 4420 .
131712 31DEC2019 0.01 0 0 4420 .
.
.
286383 12NOV2002 74.09 12.51 0.03 96476 .
286383 13NOV2002 73.76 12.45 0.03 96476 .
286383 14NOV2002 72.91 12.31 0.03 96476 .
.
.
884807 27DEC2019 . 0.1593 0.1593 26122 0.3
884807 30DEC2019 . 0.1593 0.1593 26122 0.3
884807 31DEC2019 . 0.1568 0.1568 26122 .
Where column Type is character variable denotes the companies (around 8000 companies), date is ddmmyyyy9. from 1/1/1987 to 31/12/2019. ri_us p_us, up_us, nosh, and vo are other numeric variables.
After I run the code above, I got the results as below:
data want after running this code
Type date ri_us p_us up_us nosh vo start i
131712 01JAN1987 . 0 0 4420 . 8609 1
131712 02JAN1987 . 0 0 4420 . 8609 2
.
.
131712 30DEC2019 . 0 0 4420 . 8609 8608
131712 31DEC2019 . 0 0 4420 . 8609 8608
.
.
286383 12NOV2002 74.09 0 0 516 . 8609 4139
286383 13NOV2002 73.76 0 0 516 . 8609 4140
286383 14NOV2002 72.91 0 0 516 . 8609 4141
.
.
884807 27DEC2019 . 0.1568 0.1568 26122 . 8609 8607
884807 30DEC2019 . 0.1568 0.1568 26122 . 8609 8608
884807 31DEC2019 . 0.1568 0.1568 26122 . 8609 8609
I do not know why I only deal with the column ri_us but the data columns from other column also change as well, my solution is that I will copy the dataset have, then I drop all the other columns, then I will run this code for this file, then I match again with the original dataset have by Type. However, I think it is not an optimal solution here. I am wondering if there is any chance that we can fix directly from the code above.
And there is another code like that:
data have;
set want;
by type;
lag_r =lag(ri_us);
if first.type then lag_r=.;
if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
else r1=ri_us;
run;
Do you think this set of code equals to the code above? I am wondering because when I run the proc means for ri_us for the first code and proc means for r1 for the second code, the results are different. But logically speaking, I think they are interchangeable.
Variable N Mean Std.Dev Minimum Maximum
ri_us 44799 75.384 52.099 1 199.95
r1 50858 66.42 54.639 0 199.95
Thank you very much.
I am not sure what you are trying to do but the structure looks good.
Because every time FIRST.TYPE is true one of the first two tests will be true.
It is the same as
if first.type then do ;
if ri_us < 200 then r1=ri_us;
else r1=.;
end;
else if .... then ...;
else ...;
Your note: "I do not know why I only deal with the column ri_us but the data columns from other column also change as well" - all other variables are the value of the last.type observation.
And yes, using LAG instead the array will keep the right values.
It seems to me that the new code is the correct.
What happens in your first example is that you copy the contents of the other variables "backwards" from the last record read for that type.
The difference between the two steps comes because in the second version you do the calculation also for the first value for the type (corresponding to I=1 in the first step). I think you will get the same values from the two steps if you change the second step to:
data have;
set want;
by type;
lag_r =lag(ri_us);
if first.type then r1=ri_us;
else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
else r1=ri_us;
run;
@s_lassen is right with small correction:
data want;
set have;
by type;
lag_r =lag(ri_us);
if first.type then r1=ri_us;
else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
else r1=ri_us;
run;
(it is data want; set have; - and not data have; set want; - which seems to be a typo)
I guess you are right - I just copied @Phil_NZ 's original program, and corrected what I thought was the source of error 🙂
Thank you for your comment and contribution; however, I saw that @s_lassen adjust one of my code (the row 6 in your code). You add "else" before the whole row. I try to put it into my program, and results are hard to explain to me:
My code:
data want;
set have;
by type;
lag_r =lag(ri_us);
if first.type then r1=ri_us;
if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100))
or ( lag_r <1 or ri_us<1)
or ri_us >200
then r1=.;
else r1=ri_us;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
r1 1327187 40.3076144 40.3076144 1.0000000 200.0000000
And when applied your code (adding else to the row 6)
data want;
set have;
by type;
lag_r =lag(ri_us);
if first.type then r1=ri_us;
else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100))
or ( lag_r <1 or ri_us<1)
or ri_us >200
then r1=.;
else r1=ri_us;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
r1 1327192 40.3082593 41.8585964 1.0000000 423.0700000
It seems that when we put "else" to the code like what you did, the results changed dramatically ( there is at least one variable greater than 200 (as you can see the maximum of r1 is 423 in the second case),.
It is quite weird to me, can you please help me to sort it out. Thanks and cheers.
Are you asking what is the difference between these two structures?
IF a THEN action1;
ELSE IF b THEN action2 ;
ELSE action3;
And
IF a THEN action1;
IF b THEN action2 ;
ELSE action3;
In the first one neither ACTION2 nor ACTION3 will happen if A is true.
In the second one no matter what value A has one of either ACTION2 or ACTION3 will happen.
Hi @Tom ,
Thank you for your reply, do you think the setting if...then.else below is acceptable?
Many thanks.
data want;
set have;
by type;
lag_r =lag(ri_us);
if first.type and ri_us <200 then r1=ri_us;
if first.type and ri_us >200 then r1=.;
else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100))
or ( lag_r <1 or ri_us<1)
or ri_us >200
then r1=.;
else r1=ri_us;
run;
I am not sure what you are trying to do but the structure looks good.
Because every time FIRST.TYPE is true one of the first two tests will be true.
It is the same as
if first.type then do ;
if ri_us < 200 then r1=ri_us;
else r1=.;
end;
else if .... then ...;
else ...;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.