Solved: Re: Understanding Array function with a do loop and first.

sigma_exp · Posted 08-15-2019 01:30 PM

Hello,

I am trying to understand the array functionality within the context of do loop. However, in the following code, I am running into difficulty understanding how the array is being repleted with sequential values:

This is from IDRE website. This code produces the dataset which is used in the next section.

data long_array;
  set wide;
  array Afaminc(96:98) faminc96 - faminc98;
  do year = 96 to 98;
   faminc = Afaminc[year];
   output;
  end;
  drop faminc96-faminc98;
run;
proc print data=long_array;
run;

Obs    famid    year    faminc
 1       1       96      40000
 2       1       97      40500
 3       1       98      41000
 4       2       96      45000
 5       2       97      45400
 6       2       98      45800
 7       3       96      75000
 8       3       97      76000
 9       3       98      77000

This is where I am having difficulty:

proc sort data=long_array out=long_sort;
  by famid;
run;
data wide_array;
  set long_sort;
  by famid;
  retain faminc96-faminc98;
  array Afaminc(96:98) faminc96-faminc98;
  if first.famid then do;
    do i = 96 to 98;
      Afaminc[i] = .; /*initializing to missing*/
    end;
  end;
  Afaminc(year) = faminc; /*looping across values in the variable year*/  *if last.famid then output; /* outputs only the last obs in a family*/  drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
  1        40000           .           .
  1        40000       40500           .
  1        40000       40500       41000
  2        45000           .           .
  2        45000       45400           .
  2        45000       45400       45800
  3        75000           .           .
  3        75000       76000           .
  3        75000       76000       77000

specifically for this code snippet:

  retain faminc96-faminc98;
  array Afaminc(96:98) faminc96-faminc98;
  if first.famid then do;
    do i = 96 to 98;
      Afaminc[i] = .; /*initializing to missing*/
    end;
  end;
  Afaminc(year) = faminc; /*looping across values in the variable year*/

so I understand that the retain statement is for faminc96-faminc98 by assigning default values.

However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?

famid year faminc faminc96 faminc97 faminc98

1 96 40000 . . .

1 97 40500 . . .

1 98 41000 . . .

2 96 45000 . . .

2 97 45400 . . .

2 98 45800 . . .

3 96 75000 . . .

3 97 76000 . . .

3 98 77000 . . .

Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?

Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why? see this example:

DATA sum5 ;
 SET kids ;
  BY famid ;
 
  RETAIN sumwt cnt ;
 
  IF first.famid THEN
   DO;
    sumwt = 0;
    cnt   = 0;
   END;
 
  sumwt = sumwt + wt ;
  cnt = cnt + 1 ;
  meanwt = sumwt / cnt ;
 
  IF last.famid THEN OUTPUT;
 
  KEEP famid sumwt cnt meanwt ;
 
RUN;

****And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.

famid	year	faminc	faminc96	faminc97	faminc98	i
1	96	40000	40000	.	.	99
1	97	40500	40000	40500	.	.
1	98	41000	40000	40500	41000	.
2	96	45000	45000	.	.	99
2	97	45400	45000	45400	.	.
2	98	45800	45000	45400	45800	.
3	96	75000	75000	.	.	99
3	97	76000	75000	76000	.	.
3	98	77000	75000	76000	77000	.

I would greatly appreciate if someone could point me towards the right resource or elaborate this code with a bit more background example/explanation.

Thanks a lot in advance!

PaigeMiller · Posted 08-15-2019 01:54 PM

However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?

It would appear that the array statement fills in missings for the first value of famid, and then the RETAIN causes the missings to be filled in on other records. You really don't need RETAIN, DO and FIRST. here, this is overkill.

Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?

Yes

Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why?

Yes, and then the RETAIN statement will fill in missings. As I said, a very redundant and inefficient way of programming.

And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.

it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 08-15-2019 01:54 PM

However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?

It would appear that the array statement fills in missings for the first value of famid, and then the RETAIN causes the missings to be filled in on other records. You really don't need RETAIN, DO and FIRST. here, this is overkill.

Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?

Yes

Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why?

Yes, and then the RETAIN statement will fill in missings. As I said, a very redundant and inefficient way of programming.

And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.

it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things.

--
Paige Miller

Reeza · Posted 08-15-2019 02:03 PM

Isn't first here resetting the array for each BY group. The first example doesn't need that because it's a single ID, but if you have more than one ID and are using RETAIN you have to reset them?

sigma_exp · Posted 08-15-2019 02:22 PM

Thank you very much Paige for the detailed reply. I am following through most of what you have explained. However, there is the last section that a bit more explanation might be great. For example,

"it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things."

So, Afaminc(year), let take year=96. Then the corresponding faminc value for famid=1 is 40000. Now, when it assigns the "value of FAMINC to that element of the array" -- is that Afaminc(96) [since "i" in the first iteration is 96, which actually faminc96]

and then how exactly are these "triangular" array of data being generated in that loop.

If it is not too taxing, would you please consider giving a small snippet based example of what would better way to achieve the same where it is programmatically not so distorted (sorry for using that term, but it was really giving me such hard time!).

Thank you again,

PaigeMiller · Posted 08-15-2019 02:32 PM

@sigma_exp wrote:

Thank you very much Paige for the detailed reply. I am following through most of what you have explained. However, there is the last section that a bit more explanation might be great. For example,

"it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things."

So, Afaminc(year), let take year=96. Then the corresponding faminc value for famid=1 is 40000. Now, when it assigns the "value of FAMINC to that element of the array" -- is that Afaminc(96) [since "i" in the first iteration is 96, which actually faminc96]

and then how exactly are these "triangular" array of data being generated in that loop.

One of the variables faminc96 faminc97 faminc98 is assigned a value each time the data step works with a input record. The rest are assigned by the RETAIN statement.

If it is not too taxing, would you please consider giving a small snippet based example of what would better way to achieve the same where it is programmatically not so distorted (sorry for using that term, but it was really giving me such hard time!).

Well, let me see here:

data want;
    set long_array;
    by famid;
    prev_faminc=lag(faminc);
    prev2_faminc=lag2(faminc);
    if first.famid then counter=0;
    counter+1;
    if counter=1 then faminc96=faminc;
    if counter=2 then do;
        faminc96=prev_faminc;
        faminc97=faminc;
     end;
     if counter=3 then do;
        faminc96=prev2_faminc;
        faminc97=prev_faminc;
        faminc98=faminc;
    end;
run;

Now I also want to state that I realize this is a learning exercise for you, and so the answer to the next question isn't specifically required here, but here goes ... What is the benefit of creating such a triangular arrangement of the data? I see no reason to do this, I can't think of a single analysis or display that would require such a triangular arrangement of the data.

--
Paige Miller

sigma_exp · Posted 08-15-2019 02:41 PM

ahh, the example above this one actually helps learn how it can be generated with enhanced clarity and bonus learning about the lag function. This is extremly helpful. Thank you!

Reeza · Posted 08-15-2019 03:10 PM

The triangular output is useless. This method is usually used in conjunction with LAST to keep only the last record of each group, which is essentially a transpose. The rationale for using this type of approach is when you need to restructure multiple variables at a time that would mean multiple calls to PROC TRANSPOSE + MERGE and/or wanting to calculate other statistics in the same step to avoid multiple passes of the data.

Reeza · Posted 08-15-2019 02:20 PM

*holds values across rows; 
 retain faminc96-faminc98;
*note the 96:98 which creates the index for the variables rather than 1:3, this way you can refer to it using the two digit year which is in the data set;
  array Afaminc(96:98) faminc96-faminc98;

*resets the array to missing for the start of new ids;
*you could combine the do loop here but not a huge deal;
  if first.famid then do;
    do i = 96 to 98;
      Afaminc[i] = .; /*initializing to missing*/
    end;
  end;

  *assigns value to the proper varialbe in the array using year as the index;
  *There is NO LOOP HERE;
  Afaminc(year) = faminc;

See my comments in the code above.

The UCLA tutorial on arrays itself is helpful here:

https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

Spoiler

@sigma_exp wrote:

Hello,

I am trying to understand the array functionality within the context of do loop. However, in the following code, I am running into difficulty understanding how the array is being repleted with sequential values:

This is from IDRE website. This code produces the dataset which is used in the next section.
data long_array;
  set wide;
  array Afaminc(96:98) faminc96 - faminc98;
  do year = 96 to 98;
   faminc = Afaminc[year];
   output;
  end;
  drop faminc96-faminc98;
run;
proc print data=long_array;
run;

Obs    famid    year    faminc
 1       1       96      40000
 2       1       97      40500
 3       1       98      41000
 4       2       96      45000
 5       2       97      45400
 6       2       98      45800
 7       3       96      75000
 8       3       97      76000
 9       3       98      77000
This is where I am having difficulty:
proc sort data=long_array out=long_sort;
  by famid;
run;
data wide_array;
  set long_sort;
  by famid;
  retain faminc96-faminc98;
  array Afaminc(96:98) faminc96-faminc98;
  if first.famid then do;
    do i = 96 to 98;
      Afaminc[i] = .; /*initializing to missing*/
    end;
  end;
  Afaminc(year) = faminc; /*looping across values in the variable year*/  *if last.famid then output; /* outputs only the last obs in a family*/  drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
  1        40000           .           .
  1        40000       40500           .
  1        40000       40500       41000
  2        45000           .           .
  2        45000       45400           .
  2        45000       45400       45800
  3        75000           .           .
  3        75000       76000           .
  3        75000       76000       77000
specifically for this code snippet:
  retain faminc96-faminc98;
  array Afaminc(96:98) faminc96-faminc98;
  if first.famid then do;
    do i = 96 to 98;
      Afaminc[i] = .; /*initializing to missing*/
    end;
  end;
  Afaminc(year) = faminc; /*looping across values in the variable year*/
so I understand that the retain statement is for faminc96-faminc98 by assigning default values.

However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?

famid        year        faminc        faminc96        faminc97        faminc98

1        96        40000        .        .        .

1        97        40500        .        .        .

1        98        41000        .        .        .

2        96        45000        .        .        .

2        97        45400        .        .        .

2        98        45800        .        .        .

3        96        75000        .        .        .

3        97        76000        .        .        .

3        98        77000        .        .        .
Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?
Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why? see this example:
DATA sum5 ;
 SET kids ;
  BY famid ;
 
  RETAIN sumwt cnt ;
 
  IF first.famid THEN
   DO;
    sumwt = 0;
    cnt   = 0;
   END;
 
  sumwt = sumwt + wt ;
  cnt = cnt + 1 ;
  meanwt = sumwt / cnt ;
 
  IF last.famid THEN OUTPUT;
 
  KEEP famid sumwt cnt meanwt ;
 
RUN;
****And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.
famid

year

faminc

faminc96

faminc97

faminc98

i

1

96

40000

40000

.

.

99

1

97

40500

40000

40500

.

.

1

98

41000

40000

40500

41000

.

2

96

45000

45000

.

.

99

2

97

45400

45000

45400

.

.

2

98

45800

45000

45400

45800

.

3

96

75000

75000

.

.

99

3

97

76000

75000

76000

.

.

3

98

77000

75000

76000

77000

.

I would greatly appreciate if someone could point me towards the right resource or elaborate this code with a bit more background example/explanation.

Thanks a lot in advance!

wrote: Hello, I am trying to understand the array functionality within the context of do loop. However, in the following code, I am running into difficulty understanding how the array is being repleted with sequential values: This is from IDRE website. This code produces the dataset which is used in the next section. data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000 This is where I am having difficulty: proc sort data=long_array out=long_sort; by famid; run; data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ *if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 . . 1 40000 40500 . 1 40000 40500 41000 2 45000 . . 2 45000 45400 . 2 45000 45400 45800 3 75000 . . 3 75000 76000 . 3 75000 76000 77000 specifically for this code snippet: retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ so I understand that the retain statement is for faminc96-faminc98 by assigning default values. However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop? famid year faminc faminc96 faminc97 faminc98 1 96 40000 . . . 1 97 40500 . . . 1 98 41000 . . . 2 96 45000 . . . 2 97 45400 . . . 2 98 45800 . . . 3 96 75000 . . . 3 97 76000 . . . 3 98 77000 . . . Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96? Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why? see this example: DATA sum5 ; SET kids ; BY famid ; RETAIN sumwt cnt ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN OUTPUT; KEEP famid sumwt cnt meanwt ; RUN; ****And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding. famid year faminc faminc96 faminc97 faminc98 i 1 96 40000 40000 . . 99 1 97 40500 40000 40500 . . 1 98 41000 40000 40500 41000 . 2 96 45000 45000 . . 99 2 97 45400 45000 45400 . . 2 98 45800 45000 45400 45800 . 3 96 75000 75000 . . 99 3 97 76000 75000 76000 . . 3 98 77000 75000 76000 77000 . I would greatly appreciate if someone could point me towards the right resource or elaborate this code with a bit more background example/explanation. Thanks a lot in advance!

sigma_exp · Posted 08-15-2019 02:33 PM

Thank you so much Reza! So probably that was a mistake in the main code w.r.t (Afaminc(year) = faminc; . However,
*assigns value to the proper variable in the array using the year as the index;
*There is NO LOOP HERE;
Afaminc(year) = faminc; -- how does that achieve the traingular pattern to fill out the array -- could you please elaborate

Reeza · Posted 08-15-2019 03:14 PM

@sigma_exp wrote:

Thank you so much Reza! So probably that was a mistake in the main code w.r.t (Afaminc(year) = faminc; . However,
*assigns value to the proper variable in the array using the year as the index;
*There is NO LOOP HERE;
Afaminc(year) = faminc; -- how does that achieve the traingular pattern to fill out the array -- could you please elaborate

At each row it moves that value to the correct variable and stores all the previous values.

The code is correct, the COMMENT is wrong.

Year  Value  faminc96 faminc97 faminc98
96      100    100
97      101    100           101
98      102     100          101         102    <- keep only this record at the end with last.

First pass through, the year is 96 and the value gets assigned to faminc96

Second row, year is 97, value gets assigned from 97 to faminc 97 and faminc96 is still kept because of the retain

Repeat until the last obs of the by group.

The data step itself is the loop here, there's no explicit do loop.

sigma_exp · Posted 08-15-2019 03:49 PM

okay....so it was the data step during the iteration which I was missing earlier, thank you for explaining. But in the code, it only matches to the index in the array: Afaminc(year) and corresponding faminc value gets repleted. However, how does that get matched across three other columns--faminc96 -faminc98? If SAS matches it internally, then shouldn't it either match by name or by position? But the faminc value does not meet either of them--neither it matches "faminc96" column header nor its position. The same can be said for the other two columns as well. Or is it the SAS convention to replace a value across a row (matched by index) with a specific value? I am probably missing something here.

Reeza · Posted 08-15-2019 04:47 PM

Year = 96, 97 and 98

afaminc is a an array with the indexes of 96, 97 and 98 - see my first answer to you.

so afaminc(96) -> is actually a short cut to faminc96
afaminc(97) -> short cut to faminc97.

As the data step loops it moves each year into the correct position and remembers the previous values until you explicitly reset them all at the FIRST condition. You should add a bunch of PUT statements in there in various places and see what's happening in each step if you want to trace it all through.

sigma_exp · Posted 08-16-2019 10:08 AM

okay...thanks a lot-- I will work through the %PUT and see what it does. Much appreciated Reeza for the advice

Tom · Posted 08-16-2019 10:42 AM

@sigma_exp wrote:
okay...thanks a lot-- I will work through the %PUT and see what it does. Much appreciated Reeza for the advice

Macro %PUT is not going to be of much value with that program. It is not using any macro variables.

You should the normal data step statement PUT instead. Then you can output values of the dataset variables.

sigma_exp · Posted 08-16-2019 12:15 PM

Okay, thank you Tom for pointing it out... I will edit my response.

Ready to join fellow brilliant minds for the SAS Hackathon?