Using Arrays and DO Loops: Do Over or Do I? Q&A, Slides, and On-Demand Recording

2 Likes

Watch this Ask the Expert session to learn how to use SAS Arrays and DO loops in the DATA step to improve programming efficiency.

Watch the Webinar

You will learn how to:

Use an indexed array and indexed DO loop.
Use a non-indexed array and DO over loop.
Use DO WHILE and DO UNTIL loops.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

Can you create an array with date time variables?

Yes. Because date time variables are just numeric variables and so I have created dates and I have created times. And in fact, you know in the claims data set every claim has a claim date, and sometimes I need to keep track of what date the event occurred on so I will keep track of that date and I'll store it as it just as a date. You don't need to indicate it's a date or time variable until you format.

Can you use "0" as a start?

Yes, you can use zero, but that reference is no array value. There's no element 0 in an array. The elements in the array start at a value of 1.

When do you prefer array over do loop?

I use them in conjunction with each other. Sometimes I only need do loops if I'm doing randomization. But if I'm referencing variables, then I absolutely utilize an array name with the do loop. You can do DO loops without an array, but you can't utilize variables in an array that I know of without a do loop.

Can the value of the length be stored in a macro variable?

I believe it can. I any reference to anything could be stored as a macro variable. Try it on a small data set. I try stuff all the time that doesn't work and then I find out what works.

What if you don't know the number of elements or there are too many?

I’ve never come up with a situation where there are too many. I've literally had thousands of variables in an array. I use that non indexed array and I use a do over because I want to do over all of the elements in the array. If you don't want to do over every element in that array, then you're going to need to create multiple arrays. For this set of variables I want to perform this set of operations on within a do loop that corresponds with that potential array. The other array then would have another do loop that would correspond to what operations you want to perform on that one. You'd have to have different arrays and different do loops if you don't want to perform the operation over all of them. They can all be in one if you want to perform all of those operations, but if you don't, then you know they have to be moved. For example, if I'm changing all something's to missing or all missings to 0 then I will do a do over.

Do the variables always have to have the same prefix?

No. So for example, I created that same prefix, but they don't have to have the same prefix in this list of array elements. Here, they just happen to. That's how I keep track programming wise, but I could have just had MI, CHF, PVD, CVD, DEM. Those don't even have a prefix. So no, it can be any variable and you don't have to have a prefix.

Can we use two different arrays in one dataset?

Yes, absolutely. I do that all the time. Right here I'm using three different arrays within the same do loop. You can have multiple arrays within the same data statement and within the same data set essentially and reference them multiple times with different do loops.

For this particular "do over" code, how can you ask SAS to break the loop once one condition is met?

I believe you can say if you know H. CCIM, I = 1 then end, and I think that will end the loop. So once you meet the condition you can then end again. I have never tried that because sometimes the investigators that I work with are interested in how many times that diagnosis appears and so I keep track of that. I look through all 26 but, I believe you can use that end statement within an if statement and satisfy the condition when you're inside that dual loop.

Can you use array names as the list of array elements?

No. An array name must be something different than any other variable name in your data set. It is basically a collection of variables, so you can't reference an array. You can have multi-dimensional arrays. That might be a way to reference “multiple arrays within an array”. So if you had a set of variables that had a set of variables underneath it, that might be a way to do that.

Does the dataset have to be sorted to use an array?

No, not at all. My observations for a particular individual could be in multiple places. I do the operations in the data step and sort the data set after and then I perform procedures on that sorted data set.

Could you name a couple of instances where you use arrays? I am trying to figure out how to integrate arrays and do loops into my current SAS coding.

When I'm creating datasets, I use arrays and do loops all the time for changing questionnaire values, reversing questionnaire values if I need to reverse them so that I can create a score for a particular value. I'm doing work using a Maslow's burnout inventory scale. Some of those items on that scale are reversed and so I used arrays to take those items on that scale which are just variables in the data set, reverse those items, and then utilize those reversed items. When creating a score, I take missing values and checklists and create and make them zeros so that I can count the number of items within the checklist that people said yes or no to, and I don't worry about missing values. I use them in just about every program I create because they are so useful. If I'm performing the same operation more than four times then I'm going to use an array or a do loop and I'm going to put all those variables that I want to perform the same operation on in the array statement. Then, I'm going to use the do loop to perform the operation on those elements in the array. So anytime I have to do the same operation on multiple variables I use an array.

When do you not need to indicate character type?

The $ is only needed if you are defining new character variables.  If the variables in the ARRAY are variables that have been previously defined as character variables in the data set, then you do not need the $.

How do we do the list_of_array_elements when we don't know how many there are?

You can use the dim() function instead of listing all the variable names if you don’t know how many elements are in an array.

Can we have the live webcast for SQL and macros?

We have covered those topics in previous Ask the Expert webinars:

How Do I Build a Macro Application?

Proc SQL or Proc FEDSQL: Which Should a Programmer Use?

Top 5 Handy PROC SQL Tips

Is there any EXIT statement in the DO loop?

You can use OR in your logic to end a loop, WHILE(A or B). You can also use the leave statement to exit a do loop.

This link shows two different approaches for exiting a do loop – the “leave” option and setting the loop iteration “x” to a value that will end the looping: https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/p1cydk5fq0u4bfn1xfbjt7w1c7lu.ht...

This BLOG also is helpful: https://blogs.sas.com/content/iml/2017/03/15/leave-continue-sas.html#:~:text=The%20answer%20is%20yes...

Can you use the colon abbreviation when you list the array elements?

Colon acts like wildcard and can be used to indicate a range. A colon is used to reference vars ending in a sequence number.

This paper has a section “Specifying Lower and Upper Bounds of a Temporary Array” that provides a good explanation and several examples for creating an array with lower and upper bounds.  One example uses the colon to do that:  https://support.sas.com/resources/papers/97529_Using_Arrays_in_SAS_Programming.pdf

Here is SAS documentation where Example 4: Defining More Advanced Arrays provides good examples for using colons to define bounds:  https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/p08do6szetrxe2n136ush727sbuo.ht...

How can we use Do Loop in SELECT statement of PROC SQL?

You can not specify do loop inside PROC SQL; use macro code instead https://communities.sas.com/t5/SAS-Programming/DO-loop-in-PROC-SQL/td-p/13486

How do I select Summarized value of Array Elements in PROC SQL select statement?

This community post gives directions on how to do this. https://www.sastipsbyhal.com/2012/01/proc-sql-select-values-into-macro.html

Did that substr() if-else logic capture all of the codes? I would think there are other CPT or Diag Codes satisfying those conditions that weren't the CPT or Diag Codes you were trying to capture.

Yes, subst) statement does capture all of the codes of interest. If there are other CPT or diagnosis codes of interest they would be included in the list. Since I am only looking in the diagnosis variables or in the CPT variables, only those conditions would be captured for those variables.

For STOP value, does SAS have something like arrayname.length to call the length of the array?

You should be able to use the dim(arrayname) as the STOP value for the indexed array.

Is HCCIMI a new variable? Do we need to specify it out?

Yes, HCCIMI is a new variable. You do not need to specify it outside of the DO loop. New variables can be created within a DO loop and do not need to be specified outside of the DO loop. While you can specify the new variable outside the do loop, it is not necessary.

How do you write it if the variable doesn’t have a sequence order? If HSDIAG doesn’t have the number, do we have to list all the variables?

If the variables don’t have a sequence number, then you do have to list them out in the ARRAY statement.

Can it be "HSDIAG1", not necessarily "HSDIAG01"? SAS knows "HSDIAG10" is after "HSDIAG9"?

Yes it can be HSDIAG1. This is the first variable in the sequence of variables.

Where do I write the code for people with hccimi=0? I found I need to initialize it before the array, but if I use an else statement in the array, things get wonky.

For people without an MI (HCCIMI=0), you can specify that outside of the ARRAY. I have found things get wonky as well when It is set to zero within the ARRAY.

If the first iteration has a hccmi = 1 and the second loop has a 0, won’t it be overwritten with 0?

Not that I have found. It does make sense that it would be overwritten, but my results always indicate that it doesn’t get overwritten.

Can the start, stop and increment be a variable or the result of a calculation?

Yes START, STOP, and INCREMENT can be numbers, variables, or SAS expressions. These values are set upon entry into the DO loop and cannot be modified during the processing of the DO loop. However, the INDEX-VARIABLE can be changed within the loop. A SAS array is a temporary grouping of SAS variables under a single name. The Many Ways to Effectively Utilize Array Processing

Is the statement within each iteration of the do loop repeated for each row of the dataset?

Yes, the statement(s) within each iteration of a DO loop repeated for each row of the dataset. All DATA STEP processing is done within a row of the dataset and ARRAYs and DO loops are DATA STEP statements.

Is 'do while' better than indexed array because it saves the time, when hiccimi is 1, it jumps to the next row?

I believe you still need to use an indexed array with the DO WHILE as you have to scan through the elements of the array. However, a DO WHILE or DO UNTIL would likely save computer time compared to a DO OVER.

Can one have a do i = start to stop, but add an until (so that it breaks the loop if some condition show up?)

If you want to stop iterating if a certain condition occurs, there are two ways to do this:

1) You can use the WHILE clause to iterate as long as a certain condition holds, or you can use the UNTIL clause to iterate until a certain condition holds. You can use the DO statement with a WHILE clause to iterate while a condition is true. The condition is checked before each iteration, which implies that you should intialize the stopping condition prior to the loop.

2)You can use the iterative DO statement with an UNTIL clause to iterate until a condition becomes true. The UNTIL condition is evaluated at the end of the loop, so you do not have to initialize the condition prior to the loop.

https://blogs.sas.com/content/iml/2011/09/07/loops-in-sas.html

How would you sort your data to do until?

You would have to sort it outside of the DATA STEP and then process in the DATA STEP.

proc sort dat=datasetname; by sortvariables;

data datasetname;

set datasetname; by sortvariables;

array…..;

do…..;

How is the array is loaded since many variables are not sequenced by an integer at the end?

If the variables don’t have a sequence number, then you have to list them out one-by-one in the ARRAY statement.

How we can do, if want to change all the character observations lets which is 'X' to 'NO' and all numerical observations '0' to missing in different datasets?

An ARRAY and DO loop are DATA STEP statements. If you have to change variable values in multiple datasets then you have to use a DATA STEP for each data set. You may consider using macros in this instance to reference each dataset and each set of variables.

If you don't set a length on the ARRAY statement for a character array, how does SAS assign the array element length? Does it automatically determine the longest value in the data, or is there a risk of truncation if length is not explicitly assigned?

By default, array variables or other elements in the array have a length of 8 bytes. To specify a different length, include the desired length after the $ for character arrays and after the brackets for numeric arrays, as shown in these statements: array name[3] $10 first last middle; array weight[*] 5 weight1 - weight10;

Notice the asterisk (*) inside the brackets in the WEIGHT array above. SAS must be able to determine the number of elements or variables in the array when it compiles the code. SAS determines this either by using the constant value that is specified in the brackets or by counting the number of variables in the variable list. When you want SAS to use the variable list, place an asterisk in the brackets.

Alternatively, you can leave off the variable list and specify the constant value between brackets, as shown in this example: array weight[10] 5; Because a variable list is not specified in this example, SAS uses the name of the array (WEIGHT) and adds a numeric suffix from 1 to 10 to associate or create the specified number of variables with the array. Note: SAS must be able determine the number of elements or variables in the array when it compiles the code. Therefore, you cannot place a variable name in the brackets, as illustrated in the following statement, with the intent of SAS referencing the name to determine the number of elements: array months[month_num];

http://support.sas.com/resources/papers/97529_Using_Arrays_in_SAS_Programming.pdf

In the iterative DO loop, can you have different conditions (i.e, indexing different variables MI and Stroke) within the same DO loop?

Yes, you can look for different values of the variables in the ARRAY and if your condition is met (e.g., whatever code you are looking for in the substr() statement) you can create a new variable. I went over an example of this for both MI and CHF where I assigned a value of 1 to the MI variable if the condition for MI was met, and I assigned a value of 1 to the CHF variable if the condition for CHF was met.

Can we use arrays to group columns we process often, especially with LAB data?

This is essentially what an ARRAY does. Variables are columns of a data set, so if you have lab data that use the same variable names from data set to data set, then you can have a standard set of code that sets up the ARRAY and then processes the statements you need to process over the ARRAY with a DO loop.

How do we know if ahiscd (I hope I got this right)= I or D? It's a new array variable.

This is just another ARRAY of a list of variables that contain values of “I” or “D”. This was how the variable was defined in the data set I used. I found this information in my documentation of the data sets I used.

Can you use arrays to go from long-skinny to short-wide?

Yes, see details here https://www.lexjansen.com/wuss/2008/app/app02.pdf

If a dataset has 100 records/obs. What do you consider an iteration or LOOP? Is an iteration one observation or all 100?

And iteration of the loop is done within each (one) observation/record as it is done within a DATA STEP.

Can you speak more to strata=’low' 'medium' high'?

These were values that were used to iterate through the DO loop. You can define character levels in a DO loop by listing the values in single or double quotes. Here my index variable in the DO loop is named strata and it steps through the loop for values of “low”, “medium” and “high”.

Is order of the array elements important?

Yes, if you are referencing a particular element within multiple arrays. For example, let’s say I want to reverse 4 items of a questionnaire by creating a set of new reversed value variables. If I am in the fifth iteration a DO loop, I am reversing the fifth element of the first array and assigning it to the fifth element of the second array. If you do not make sure that the variables are in the same order, you won’t be assigning the correct value to the correct new variable.

The correct set up which assigns the reversed value of q4 to rq4, q8 to rq8, q12 to rq12, and q16 to rq16

data correct;

set q;

array aoriginal q4 q8 q12 q16;

array areverse rq4 rq8 rq12 rq16;

do over aoriginal;

areverse=5-aoriginal;

end;

versus the incorrect set up which assigns the reversed value of q4 to rq8, q8 to rq4, q12 to rq16, and q16 to rq12

data correct;

set q;

array aoriginal q4 q8 q12 q16;

array areverse rq8 rq4 rq16 rq12;

do over aoriginal;

areverse=5-aoriginal;

end;

Do the variables you want to use in an array have to be next to each other to use the dash "-" when listing variables? Are there shortcuts instead of listing every variable if the variables do not have the same prefixes?

No. The variables do not have to be next to each other in the data set to use the dash. You can use the colon after the prefix (as I learned just today!) to select all variables with that specific prefix: HSDIAG: versus HSDIAG1-HSDIAG26.

When using Enterprise Guide, I write "dim(*): on an array, I got warning from SAS. Does it mean that SAS won't take this function dim()?

https://stackoverflow.com/questions/54589665/sas-dim-and-macro-variables

Can you talk about using the DIM function in arrays? When they should use, etc.

https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p18toxpk18mlr1n1x7yihyj8o8j...

Can the short-wide to long-skinny dataset example code snippet can be used as alternative of proc transpose?

Yes.

Any tips on using multiple variables in a WHERE statement? Can an array be used?

An ARRAY is used in a DATA STEP and a WHERE statement is used with PROC statements.

In providing the slides, can a little more information be provided on the actual 'analytic problems' these tools solve?

This is outside the purview of this presentation.

Is there a limit to the number of array elements?

No, I don’t believe there is. As I indicated in the presentation, I have used thousands of variables in an ARRAY statement.

How could I skip the do loop operation on a variable when its value is missing? For example, using the variable as the denominator.

https://communities.sas.com/t5/SAS-Programming/Skip-few-iterations-when-condition-met/td-p/436226

What’s the advantage of using dim(arrayname)?

You can use an indexed DO loop without knowing the dimension of the array and allowing SAS to determine it.

If working on multiple arrays, how do you decide the order of the arrays referenced in the do loop?

Because you are processing through a DATA STEP, order of statements does matter, just like order of statements matter outside of using an ARRAY with a DO loop.

Can we use _char_ or _numeric_ in array to change all character or numeric values?

See pages 12 – 14 http://support.sas.com/resources/papers/97529_Using_Arrays_in_SAS_Programming.pdf

Should the arrays be in the same data set or in different data sets?

All array processing is done within a DATA STEP for a particular data set, so if you need to perform the operations on multiple data sets you can either use macros to reference the multiple data sets or you need to copy the code from one DATA STEP to the other.

I understand iteration, but please explain what index means?

The index is the variable name assigned to how you are referencing a particular element in an ARRAY.

DO stratum=1 to 3;

Here the index variable has variable name “stratum”;

Do I have to order variables like v1 v2 v3?

No. You can use any valid variable name in an ARRAY statement.

In the early examples, if hccimi corresponds to hsdiag[i}, shouldn't hccimi be defined as an array also - hccimi[i]?

You don’t have to, but you can. You would need to set up an ARRAY for the HCCIMI variables (ARRAY ahccimi {26} hccimi1-hccimi26;) This would then assign a value of 1 to hccimi1 if the hsdiag1 variable contained the code for an MI.

Recommended Resources

A Beginners Guide to Using Arrays and DO loops

SAS Tutorial | How to Restructure Your Data Using Arrays and DO Loops

Please see additional resources in the attached slide deck.

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

mkeintz · ‎09-05-2022

Regarding the question:

"Can you use "0" as a start?

Yes, you can use zero, but that reference is no array value. There's no element 0 in an array. The elements in the array start at a value of 1."

Either the question or the response needs more context.

I presume the statement "There's no element 0 in an array. The elements in the array start at a value of 1." refers to the automatic naming convention used when no variable names are assigned to the array, as in:

array x {0:3};

which assigns variables x1, x2, x3, and x4 to array elements x{0}, ... x{3}. So yes, there is an "element 0" in the array, but there is no corresponding X0 in the associated variables.

Of course, the array index and corresponding variable names can be aligned simply by

array x {0:3} x0-x3;

Tom · ‎09-05-2022

Thanks for presenting this. Here are some points about arrays, indexing and DO loops that I hope will clarify some of this:

SAS only has two types of variables, floating point numbers and fixed length character strings. An array is NOT a new type of variable. It is just syntax added to the data step to allow referencing variables indirectly.

There are two ways to reference a variable in the array: using an explicit index (A[3]) or an implicit index (A). With the implicit indexing there is an index variable associated with the array that determines which variable in the array an implicit reference means. The default variable is _I_, but you can define a different name in the array statement. This index variable will not become part of any output dataset (like other automatic variables such as _N_).

The DO OVER loop is an example of an implicit index. So

do over A; A=A+1; end

is the same as

do _i_=1 to dim(a); A[_i_]=A[_i_]+1; end;

Each array can be referenced by just one of the two ways in a data step. Also, if you define the array using an explicit dimension (even if that dimension is *)

array A [*] A1-A5;

then you must reference the variables in the array explicitly. But if you define the array without the explicit dimension

array A A1-A5;

, you can still reference it with an explicit index if you want, you just can’t reference it both ways.

The DO statement is more flexible than this paper describes.

specification

denotes an expression or a series of expressions in this form:

start <TO stop> <BY increment> <WHILE(expression) | UNTIL(expression)>

So using that power many of the example data steps in the presentation could be a lot simpler.

do i=1 to dim(ahsdiag) while (hccimi=0);
  if substr(ahsdiag[i],1,3) in ('410','412') then hccimi=1;
end;

Using Arrays and DO Loops: Do Over or Do I? Q&A, Slides, and On-Demand Recording

Q&A

Recommended Resources

specification

Click image to register for webinar

Classroom Training Available!