Hi SAS Community,
After some dicussions, I obtain the code as described below:
data want;
set have;
by type;
array _date(11712) _temporary_;
array _ri_us(11712) _temporary_;
retain _date: _ri_us: start;
if first.type then do;
  call missing(of _date[*]);
  call missing(of _ri_us[*]);
 start = 0;
end;
start = start+1;
_date[start]=date;
_ri_us[start] = ri_us;
if last.type then do;
 do i = 1 to start;
    if i > 1 then do;
        if (((1 + _ri_us[i]/100)*( 1 +_ri_us[i-1]/100) -1) <= 0.5 and (_ri_us[i] > 100 or _ri_us[i-1] > 100))
        or (_ri_us[i] < 1 or _ri_us[i-1] < 1) then do;
              ri_us=.;
              date = _date[i];
              output;
         end;
         else do;
             ri_us=_ri_us[i];
              date = _date[i];
              output;
         end;
    end;
     else do;
          ri_us=_ri_us[i];
              date = _date[i];
              output;
     end;
 end; 
end;
run;And, yes, I receive the result for column ri_us. Unfortunately, all other columns are also affected, as described below:
dataset have before running this code:
Type date ri_us p_us up_us nosh vo
131712	01JAN1987	.	.	.	.	.
131712	02JAN1987	.	.	.	.	.
.
.
131712	30DEC2019	0.01	0	0	4420	.
131712	31DEC2019	0.01	0	0	4420	.
.
.
286383	12NOV2002	74.09	12.51	0.03	96476	.
286383	13NOV2002	73.76	12.45	0.03	96476	.
286383	14NOV2002	72.91	12.31	0.03	96476	.
.
.
884807	27DEC2019	.	0.1593	0.1593	26122	0.3
884807	30DEC2019	.	0.1593	0.1593	26122	0.3
884807	31DEC2019	.	0.1568	0.1568	26122	.Where column Type is character variable denotes the companies (around 8000 companies), date is ddmmyyyy9. from 1/1/1987 to 31/12/2019. ri_us p_us, up_us, nosh, and vo are other numeric variables.
After I run the code above, I got the results as below:
data want after running this code
Type date ri_us p_us up_us nosh vo start i
131712	01JAN1987	.	0	0	4420	.    8609   1
131712	02JAN1987	.	0	0	4420	.    8609   2
.
.
131712	30DEC2019	.	0	0	4420	.    8609   8608
131712	31DEC2019	.	0	0	4420	.    8609   8608
.
.
286383	12NOV2002	74.09	0	0	516	.    8609   4139
286383	13NOV2002	73.76	0	0	516	.    8609   4140
286383	14NOV2002	72.91	0	0	516	.    8609   4141
. 
.
884807	27DEC2019	.	0.1568	0.1568	26122	.    8609   8607
884807	30DEC2019	.	0.1568	0.1568	26122	.    8609   8608
884807	31DEC2019	.	0.1568	0.1568	26122	.    8609   8609
I do not know why I only deal with the column ri_us but the data columns from other column also change as well, my solution is that I will copy the dataset have, then I drop all the other columns, then I will run this code for this file, then I match again with the original dataset have by Type. However, I think it is not an optimal solution here. I am wondering if there is any chance that we can fix directly from the code above.
And there is another code like that:
data have;
set want;
 by type;
  lag_r =lag(ri_us);
  if first.type then lag_r=.;
  if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;Do you think this set of code equals to the code above? I am wondering because when I run the proc means for ri_us for the first code and proc means for r1 for the second code, the results are different. But logically speaking, I think they are interchangeable.
Variable N     Mean    Std.Dev  Minimum Maximum
ri_us    44799 75.384  52.099   1       199.95
r1       50858 66.42   54.639   0       199.95Thank you very much.
I am not sure what you are trying to do but the structure looks good.
Because every time FIRST.TYPE is true one of the first two tests will be true.
It is the same as
if first.type then do ;
   if ri_us < 200 then r1=ri_us;
   else r1=.;
end;
else if .... then ...;
else ...;Your note: "I do not know why I only deal with the column ri_us but the data columns from other column also change as well" - all other variables are the value of the last.type observation.
And yes, using LAG instead the array will keep the right values.
It seems to me that the new code is the correct.
What happens in your first example is that you copy the contents of the other variables "backwards" from the last record read for that type.
The difference between the two steps comes because in the second version you do the calculation also for the first value for the type (corresponding to I=1 in the first step). I think you will get the same values from the two steps if you change the second step to:
data have;
set want;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;@s_lassen is right with small correction:
data want;
  set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;(it is data want; set have; - and not data have; set want; - which seems to be a typo)
I guess you are right - I just copied @Phil_NZ 's original program, and corrected what I thought was the source of error 🙂
Thank you for your comment and contribution; however, I saw that @s_lassen adjust one of my code (the row 6 in your code). You add "else" before the whole row. I try to put it into my program, and results are hard to explain to me:
My code:
data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;The MEANS Procedure
Variable N           Mean          Std Dev       Minimum        Maximum
r1       1327187     40.3076144    40.3076144    1.0000000      200.0000000
And when applied your code (adding else to the row 6)
data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;The MEANS Procedure
Variable N           Mean          Std Dev       Minimum        Maximum
r1       1327192    40.3082593     41.8585964    1.0000000      423.0700000It seems that when we put "else" to the code like what you did, the results changed dramatically ( there is at least one variable greater than 200 (as you can see the maximum of r1 is 423 in the second case),.
It is quite weird to me, can you please help me to sort it out. Thanks and cheers.
Are you asking what is the difference between these two structures?
IF a THEN action1;
ELSE IF b THEN action2 ;
ELSE action3;And
IF a THEN action1;
IF b THEN action2 ;
ELSE action3;In the first one neither ACTION2 nor ACTION3 will happen if A is true.
In the second one no matter what value A has one of either ACTION2 or ACTION3 will happen.
Hi @Tom ,
Thank you for your reply, do you think the setting if...then.else below is acceptable?
Many thanks.
data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type and ri_us <200 then r1=ri_us;
  if first.type and ri_us >200 then r1=.;
  else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;I am not sure what you are trying to do but the structure looks good.
Because every time FIRST.TYPE is true one of the first two tests will be true.
It is the same as
if first.type then do ;
   if ri_us < 200 then r1=ri_us;
   else r1=.;
end;
else if .... then ...;
else ...;It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
