Solved: Re: Using array for one variable will affect all other variables?

Phil_NZ · Posted 07-10-2020 02:44 AM

Hi SAS Community,

After some dicussions, I obtain the code as described below:

data want;
set have;
by type;
array _date(11712) _temporary_;
array _ri_us(11712) _temporary_;
retain _date: _ri_us: start;
if first.type then do;
  call missing(of _date[*]);
  call missing(of _ri_us[*]);
 start = 0;
end;
start = start+1;
_date[start]=date;
_ri_us[start] = ri_us;
if last.type then do;
 do i = 1 to start;
    if i > 1 then do;
        if (((1 + _ri_us[i]/100)*( 1 +_ri_us[i-1]/100) -1) <= 0.5 and (_ri_us[i] > 100 or _ri_us[i-1] > 100))
        or (_ri_us[i] < 1 or _ri_us[i-1] < 1) then do;
              ri_us=.;
              date = _date[i];
              output;
         end;
         else do;
             ri_us=_ri_us[i];
              date = _date[i];
              output;
         end;
    end;
     else do;
          ri_us=_ri_us[i];
              date = _date[i];
              output;
     end;
 end; 
end;
run;

And, yes, I receive the result for column ri_us. Unfortunately, all other columns are also affected, as described below:

dataset have before running this code:

Type date ri_us p_us up_us nosh vo
131712	01JAN1987	.	.	.	.	.
131712	02JAN1987	.	.	.	.	.
.
.
131712	30DEC2019	0.01	0	0	4420	.
131712	31DEC2019	0.01	0	0	4420	.
.
.
286383	12NOV2002	74.09	12.51	0.03	96476	.
286383	13NOV2002	73.76	12.45	0.03	96476	.
286383	14NOV2002	72.91	12.31	0.03	96476	.
.
.
884807	27DEC2019	.	0.1593	0.1593	26122	0.3
884807	30DEC2019	.	0.1593	0.1593	26122	0.3
884807	31DEC2019	.	0.1568	0.1568	26122	.

Where column Type is character variable denotes the companies (around 8000 companies), date is ddmmyyyy9. from 1/1/1987 to 31/12/2019. ri_us p_us, up_us, nosh, and vo are other numeric variables.

After I run the code above, I got the results as below:

data want after running this code

Type date ri_us p_us up_us nosh vo start i
131712	01JAN1987	.	0	0	4420	.    8609   1
131712	02JAN1987	.	0	0	4420	.    8609   2
.
.
131712	30DEC2019	.	0	0	4420	.    8609   8608
131712	31DEC2019	.	0	0	4420	.    8609   8608
.
.
286383	12NOV2002	74.09	0	0	516	.    8609   4139
286383	13NOV2002	73.76	0	0	516	.    8609   4140
286383	14NOV2002	72.91	0	0	516	.    8609   4141
. 
.
884807	27DEC2019	.	0.1568	0.1568	26122	.    8609   8607
884807	30DEC2019	.	0.1568	0.1568	26122	.    8609   8608
884807	31DEC2019	.	0.1568	0.1568	26122	.    8609   8609

I do not know why I only deal with the column ri_us but the data columns from other column also change as well, my solution is that I will copy the dataset have, then I drop all the other columns, then I will run this code for this file, then I match again with the original dataset have by Type. However, I think it is not an optimal solution here. I am wondering if there is any chance that we can fix directly from the code above.

And there is another code like that:

data have;
set want;
 by type;
  lag_r =lag(ri_us);
  if first.type then lag_r=.;
  if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;

Do you think this set of code equals to the code above? I am wondering because when I run the proc means for ri_us for the first code and proc means for r1 for the second code, the results are different. But logically speaking, I think they are interchangeable.

Variable N     Mean    Std.Dev  Minimum Maximum
ri_us    44799 75.384  52.099   1       199.95
r1       50858 66.42   54.639   0       199.95

Thank you very much.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Tom · Posted 07-10-2020 11:12 PM

I am not sure what you are trying to do but the structure looks good.

Because every time FIRST.TYPE is true one of the first two tests will be true.

It is the same as

if first.type then do ;
   if ri_us < 200 then r1=ri_us;
   else r1=.;
end;
else if .... then ...;
else ...;

View solution in original post

Shmuel · Posted 07-10-2020 05:17 AM

Your note: "I do not know why I only deal with the column ri_us but the data columns from other column also change as well" - all other variables are the value of the last.type observation.

And yes, using LAG instead the array will keep the right values.

It seems to me that the new code is the correct.

s_lassen · Posted 07-10-2020 05:40 AM

What happens in your first example is that you copy the contents of the other variables "backwards" from the last record read for that type.

The difference between the two steps comes because in the second version you do the calculation also for the first value for the type (corresponding to I=1 in the first step). I think you will get the same values from the two steps if you change the second step to:

data have;
set want;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;

Shmuel · Posted 07-10-2020 07:37 AM

@s_lassen is right with small correction:

data want;
  set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if ((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us<100) then r1=.;
  else r1=ri_us;
run;

(it is data want; set have; - and not data have; set want; - which seems to be a typo)

s_lassen · Posted 07-10-2020 10:50 AM

I guess you are right - I just copied @Phil_NZ 's original program, and corrected what I thought was the source of error 🙂

Phil_NZ · Posted 07-10-2020 04:12 PM

Hi @s_lassen and @Shmuel

Thank you for your comment and contribution; however, I saw that @s_lassen adjust one of my code (the row 6 in your code). You add "else" before the whole row. I try to put it into my program, and results are hard to explain to me:

My code:

data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;

The MEANS Procedure
Variable N           Mean          Std Dev       Minimum        Maximum
r1       1327187     40.3076144    40.3076144    1.0000000      200.0000000

And when applied your code (adding else to the row 6)

data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type then r1=ri_us;
  else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;

The MEANS Procedure
Variable N           Mean          Std Dev       Minimum        Maximum
r1       1327192    40.3082593     41.8585964    1.0000000      423.0700000

It seems that when we put "else" to the code like what you did, the results changed dramatically ( there is at least one variable greater than 200 (as you can see the maximum of r1 is 423 in the second case),.
It is quite weird to me, can you please help me to sort it out. Thanks and cheers.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Tom · Posted 07-10-2020 06:38 PM

Are you asking what is the difference between these two structures?

IF a THEN action1;
ELSE IF b THEN action2 ;
ELSE action3;

And

IF a THEN action1;

IF b THEN action2 ;
ELSE action3;

In the first one neither ACTION2 nor ACTION3 will happen if A is true.

In the second one no matter what value A has one of either ACTION2 or ACTION3 will happen.

Phil_NZ · Posted 07-10-2020 07:16 PM

Hi @Tom ,

Thank you for your reply, do you think the setting if...then.else below is acceptable?

Many thanks.

data want;
set have;
 by type;
  lag_r =lag(ri_us);
  if first.type and ri_us <200 then r1=ri_us;
  if first.type and ri_us >200 then r1=.;
  else if (((1+lag_r/100)*(1+ri_us/100)-1) <= 0.5 and (lag_r>100 or ri_us>100)) 
  or ( lag_r <1 or ri_us<1) 
  or ri_us >200
  then r1=.;
  else r1=ri_us;
run;

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Tom · Posted 07-10-2020 11:12 PM

I am not sure what you are trying to do but the structure looks good.

Because every time FIRST.TYPE is true one of the first two tests will be true.

It is the same as

if first.type then do ;
   if ri_us < 200 then r1=ri_us;
   else r1=.;
end;
else if .... then ...;
else ...;

SAS Training: Just a Click Away