Hello,
I am trying to understand the array functionality within the context of do loop. However, in the following code, I am running into difficulty understanding how the array is being repleted with sequential values:
This is from IDRE website. This code produces the dataset which is used in the next section.
data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
This is where I am having difficulty:
proc sort data=long_array out=long_sort; by famid; run; data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ *if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 . . 1 40000 40500 . 1 40000 40500 41000 2 45000 . . 2 45000 45400 . 2 45000 45400 45800 3 75000 . . 3 75000 76000 . 3 75000 76000 77000
specifically for this code snippet:
retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/
so I understand that the retain statement is for faminc96-faminc98 by assigning default values.
famid year faminc faminc96 faminc97 faminc98
1 96 40000 . . .
1 97 40500 . . .
1 98 41000 . . .
2 96 45000 . . .
2 97 45400 . . .
2 98 45800 . . .
3 96 75000 . . .
3 97 76000 . . .
3 98 77000 . . .
DATA sum5 ; SET kids ; BY famid ; RETAIN sumwt cnt ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN OUTPUT; KEEP famid sumwt cnt meanwt ; RUN;
famid | year | faminc | faminc96 | faminc97 | faminc98 | i |
1 | 96 | 40000 | 40000 | . | . | 99 |
1 | 97 | 40500 | 40000 | 40500 | . | . |
1 | 98 | 41000 | 40000 | 40500 | 41000 | . |
2 | 96 | 45000 | 45000 | . | . | 99 |
2 | 97 | 45400 | 45000 | 45400 | . | . |
2 | 98 | 45800 | 45000 | 45400 | 45800 | . |
3 | 96 | 75000 | 75000 | . | . | 99 |
3 | 97 | 76000 | 75000 | 76000 | . | . |
3 | 98 | 77000 | 75000 | 76000 | 77000 | . |
I would greatly appreciate if someone could point me towards the right resource or elaborate this code with a bit more background example/explanation.
Thanks a lot in advance!
However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?
It would appear that the array statement fills in missings for the first value of famid, and then the RETAIN causes the missings to be filled in on other records. You really don't need RETAIN, DO and FIRST. here, this is overkill.
Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?
Yes
Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why?
Yes, and then the RETAIN statement will fill in missings. As I said, a very redundant and inefficient way of programming.
And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.
it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things.
However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?
It would appear that the array statement fills in missings for the first value of famid, and then the RETAIN causes the missings to be filled in on other records. You really don't need RETAIN, DO and FIRST. here, this is overkill.
Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?
Yes
Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why?
Yes, and then the RETAIN statement will fill in missings. As I said, a very redundant and inefficient way of programming.
And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.
it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things.
Thank you very much Paige for the detailed reply. I am following through most of what you have explained. However, there is the last section that a bit more explanation might be great. For example,
"it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things."
So, Afaminc(year), let take year=96. Then the corresponding faminc value for famid=1 is 40000. Now, when it assigns the "value of FAMINC to that element of the array" -- is that Afaminc(96) [since "i" in the first iteration is 96, which actually faminc96]
and then how exactly are these "triangular" array of data being generated in that loop.
If it is not too taxing, would you please consider giving a small snippet based example of what would better way to achieve the same where it is programmatically not so distorted (sorry for using that term, but it was really giving me such hard time!).
Thank you again,
@sigma_exp wrote:
Thank you very much Paige for the detailed reply. I am following through most of what you have explained. However, there is the last section that a bit more explanation might be great. For example,
"it is creating the "triangular" array of data that is shown. Afaminc(year) takes the value of each year, and the assigns the value of FAMINC to that element of the array. Again, this seems like a very convoluted way to program things."
So, Afaminc(year), let take year=96. Then the corresponding faminc value for famid=1 is 40000. Now, when it assigns the "value of FAMINC to that element of the array" -- is that Afaminc(96) [since "i" in the first iteration is 96, which actually faminc96]
and then how exactly are these "triangular" array of data being generated in that loop.
One of the variables faminc96 faminc97 faminc98 is assigned a value each time the data step works with a input record. The rest are assigned by the RETAIN statement.
If it is not too taxing, would you please consider giving a small snippet based example of what would better way to achieve the same where it is programmatically not so distorted (sorry for using that term, but it was really giving me such hard time!).
Well, let me see here:
data want;
set long_array;
by famid;
prev_faminc=lag(faminc);
prev2_faminc=lag2(faminc);
if first.famid then counter=0;
counter+1;
if counter=1 then faminc96=faminc;
if counter=2 then do;
faminc96=prev_faminc;
faminc97=faminc;
end;
if counter=3 then do;
faminc96=prev2_faminc;
faminc97=prev_faminc;
faminc98=faminc;
end;
run;
Now I also want to state that I realize this is a learning exercise for you, and so the answer to the next question isn't specifically required here, but here goes ... What is the benefit of creating such a triangular arrangement of the data? I see no reason to do this, I can't think of a single analysis or display that would require such a triangular arrangement of the data.
*holds values across rows; retain faminc96-faminc98; *note the 96:98 which creates the index for the variables rather than 1:3, this way you can refer to it using the two digit year which is in the data set; array Afaminc(96:98) faminc96-faminc98; *resets the array to missing for the start of new ids; *you could combine the do loop here but not a huge deal; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; *assigns value to the proper varialbe in the array using year as the index; *There is NO LOOP HERE; Afaminc(year) = faminc;
See my comments in the code above.
The UCLA tutorial on arrays itself is helpful here:
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
@sigma_exp wrote:
Hello,
I am trying to understand the array functionality within the context of do loop. However, in the following code, I am running into difficulty understanding how the array is being repleted with sequential values:
This is from IDRE website. This code produces the dataset which is used in the next section.
data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
This is where I am having difficulty:
proc sort data=long_array out=long_sort; by famid; run; data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ *if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 . . 1 40000 40500 . 1 40000 40500 41000 2 45000 . . 2 45000 45400 . 2 45000 45400 45800 3 75000 . . 3 75000 76000 . 3 75000 76000 77000
specifically for this code snippet:
retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/
so I understand that the retain statement is for faminc96-faminc98 by assigning default values.
- However, in the array Afaminc(96:98) when we assign those variables, they are still having missing values -- so why are we again assigning "." as commented in /*initializing to missing*/ in the do loop?
famid year faminc faminc96 faminc97 faminc98
1 96 40000 . . .
1 97 40500 . . .
1 98 41000 . . .
2 96 45000 . . .
2 97 45400 . . .
2 98 45800 . . .
3 96 75000 . . .
3 97 76000 . . .
3 98 77000 . . .
- Second, what are we looping over in the Afaminc[i] expression? If this is by array index, then are we going over Afaminc[96] which is actually the faminc96?
- Third, what is exactly if first.famid then do the purpose in this -- does that mean only for the first observation for famid, it will loop over 96 to 98? In some other SAS documents, it only uses do one time but here we have done twice --why? see this example:
DATA sum5 ; SET kids ; BY famid ; RETAIN sumwt cnt ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN OUTPUT; KEEP famid sumwt cnt meanwt ; RUN;- ****And lastly, this is the most challenging part so far. The Afaminc(year) = faminc; /*looping across values in the variable year*/ this actually fills out the empty array in a specific format which I was not quite able to follow. The looping mechanism is not very explicitly understood, however, it seems to have filled out the array in a specific pattern which I am having difficulty understanding.
famid
year
faminc
faminc96
faminc97
faminc98
i
1
96
40000
40000
.
.
99
1
97
40500
40000
40500
.
.
1
98
41000
40000
40500
41000
.
2
96
45000
45000
.
.
99
2
97
45400
45000
45400
.
.
2
98
45800
45000
45400
45800
.
3
96
75000
75000
.
.
99
3
97
76000
75000
76000
.
.
3
98
77000
75000
76000
77000
.
I would greatly appreciate if someone could point me towards the right resource or elaborate this code with a bit more background example/explanation.
Thanks a lot in advance!
Thank you so much Reza! So probably that was a mistake in the main code w.r.t (Afaminc(year) = faminc; . However,
*assigns value to the proper variable in the array using the year as the index;
*There is NO LOOP HERE;
Afaminc(year) = faminc; -- how does that achieve the traingular pattern to fill out the array -- could you please elaborate
@sigma_exp wrote:
Thank you so much Reza! So probably that was a mistake in the main code w.r.t (Afaminc(year) = faminc; . However,
*assigns value to the proper variable in the array using the year as the index;
*There is NO LOOP HERE;
Afaminc(year) = faminc; -- how does that achieve the traingular pattern to fill out the array -- could you please elaborate
At each row it moves that value to the correct variable and stores all the previous values.
The code is correct, the COMMENT is wrong.
Year Value faminc96 faminc97 faminc98
96 100 100
97 101 100 101
98 102 100 101 102 <- keep only this record at the end with last.
First pass through, the year is 96 and the value gets assigned to faminc96
Second row, year is 97, value gets assigned from 97 to faminc 97 and faminc96 is still kept because of the retain
Repeat until the last obs of the by group.
The data step itself is the loop here, there's no explicit do loop.
@sigma_exp wrote:
okay...thanks a lot-- I will work through the %PUT and see what it does. Much appreciated Reeza for the advice
Macro %PUT is not going to be of much value with that program. It is not using any macro variables.
You should the normal data step statement PUT instead. Then you can output values of the dataset variables.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.