Solved: Programming Efficiency with Do Loop

LEINAARE · Posted 11-15-2019 09:55 AM

Hello,

I have a series of repetitive IF, THEN, ELSE statements in my data step (shown below). The program works fine, but I would like to be more efficient in my programming.

if year=2012 and eligible=1 then ELIG_2012 + 1; else if year ne 2012 then ELIG_2012=.;
if year=2013 and eligible=1 then ELIG_2013 + 1; else if year ne 2013 then ELIG_2013=.;
if year=2014 and eligible=1 then ELIG_2014 + 1; else if year ne 2014 then ELIG_2014=.;
if year=2015 and eligible=1 then ELIG_2015 + 1; else if year ne 2015 then ELIG_2015=.;
if year=2016 and eligible=1 then ELIG_2016 + 1; else if year ne 2016 then ELIG_2016=.;
if year=2017 and eligible=1 then ELIG_2017 + 1; else if year ne 2017 then ELIG_2017=.;
if year=2018 and eligible=1 then ELIG_2018 + 1; else if year ne 2018 then ELIG_2018=.;

I am trying to develop a DO loop to perform this, but cannot figure out how to increment the ELIG_ series variable names. Could anyone offer some help with this?

Thanks,

Ted

Tom · Posted 11-15-2019 11:24 AM

You seemed to have created a strange/bloated structure. You have both multiple rows and multiple columns.

Assuming that there is only one "1" in those gazillion variables I think you just need to add the columns.

elig_2012 = sum(elig_2012,of jan_2012--dec_2012);
elig_2013 = sum(elig_2013,of jan_2013--dec_2013);
...

Note the positional variable list will only work if the variables are actually defined in that order.

You potentially could use ARRAYS or macro code to generate those statements, but since you probably have a finite set of years it might just be easiest to just use copy and paste to create the multiple statements.

Looks you are counting how many months in that year the patient had. You probably will need to add code to reset the new variables when you start a new patient id.

if first.id then call missing(of elig_2012-elig_2019);

View solution in original post

Kurt_Bremser · Posted 11-15-2019 10:02 AM

Summarize eligibility by year, then transpose with id year;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LEINAARE · Posted 11-15-2019 10:31 AM

Thank you for the suggestion. This would work if my data were structured differently. But the dataset is wide, with individual variables for each month (Jan - Dec) of each year (2012-2018), equaling 84 dichotomous variables for month. This is why I created accumulating variables to count the frequency of observations per year for each individual. I could sum by individual using the means procedure, and add the values (0/1) of the month variables for each year though.

Thanks,

Ted

Tom · Posted 11-15-2019 10:37 AM

@LEINAARE wrote:

Thank you for the suggestion. This would work if my data were structured differently. But the dataset is wide, with individual variables for each month (Jan - Dec) of each year (2012-2018), equaling 84 dichotomous variables for month. This is why I created accumulating variables to count the frequency of observations per year for each individual. I could sum by individual using the means procedure, and add the values (0/1) of the month variables for each year though.

Thanks,

Ted

Please explain what you are trying to generate. Are you trying to determine eligibility for a YEAR based on some count of eligibility for a month? If so what is the logic? Is it any month? all months? some minimum number of months?

Is the goal just to set that year level eligibility flag? Or are you trying to count number of patients per year?

To me it looks like you want to get the denominator for your usage counts. There were XX patients diagnosed with condition YY in a particular time period out of ZZ patients who had insurance coverage over that time period.

LEINAARE · Posted 11-15-2019 10:50 AM

Hi @Tom,

I am creating cohorts from Medicaid eligibility data. In the final dataset (which I have already generated and checked for accuracy), each individual will have one observation per months from January 2012 - December 2018. For months that individuals do not have eligibility, an eligibility flag is coded zero (Eligible = 0). I use the MEANS procedure to sum the dichotomized values for each of the 84 month variables, and then TRANSPOSE to convert the dataset from wide to long. All of this has already been done.

What I want to do is generate a count of eligible months by year, so that there are 7 variables with the count of eligible months an individual had for the corresponding year. I could use another MEANS procedure and sum by individual and year, but I am using a data step to perform other manipulations, and wanted to output count variables using the same data step.

ballardw · Posted 11-15-2019 11:48 AM

@LEINAARE wrote:

Hi @Tom,

I am creating cohorts from Medicaid eligibility data. In the final dataset (which I have already generated and checked for accuracy), each individual will have one observation per months from January 2012 - December 2018. For months that individuals do not have eligibility, an eligibility flag is coded zero (Eligible = 0). I use the MEANS procedure to sum the dichotomized values for each of the 84 month variables, and then TRANSPOSE to convert the dataset from wide to long. All of this has already been done.

84 Month variables means that you may have already gone a LONG way down the "I'm stuck in Excel path".

If you have a date value and want to summarize a variable (or 10 or ..) you might want to consider using a FORMAT to create the groups fairly directly. This goes from "long" to "long" with a count of months (sum as you did) with eligible=1. The value of DATE is still a date, it just has a different display value after the summary. The actual underlying date value would usually be the first date for a year encountered in the data. The Drop removes a variable, _freq_ that would indicate how many records were used for each ID and might be important if you are only interested in complete years (i.e. 12 monthly values).

data example;
   do id = 1 to 3;
   do year= 2012 to 2018;
   do month = 1 to 12;  
      date = mdy(month,1,year);
      eligible = rand('integer',2) -1;
      output;
   end;
   end;
   end;
run;

proc summary data=example nway;
   class id date;
   format date year4.;
   var eligible;
   output out=work.eligcount (drop= _:) sum=EligibleMonths;
run;

Sometimes, especially when dealing with efficiency, it may be better to use one of the appropriate procedures and merge data back then to spend the time working on possibly quite complex logic to keep track of things through a data step.

Astounding · Posted 11-15-2019 10:20 AM

Here's an approach, but it requires a little learning. That can't be a bad thing, can it?

array elig {2012:2018} elig_2012 - elig_2018;
if (2012 <= year <= 2018) then do;
   if eligible=1 then elig{year} = 1;
end;

I didn't address the "else" portion of your logic yet (setting variables back to missing), since I'm not sure it's the right logic to apply. But if you are sure, we can revisit that.

Also, I changed elig + 1 to elig = 1. I'm presuming that values of 1 or missing would be sufficient, but perhaps we need to revisit that as well.

Finally, we should consider what you are trying to achieve here. It's possible you could calculate the values you are looking for quite easily with something like this:

proc means data=have sum n nmiss;
   var eligible;
   class year;
run;

ballardw · Posted 11-15-2019 10:36 AM

Adding to @Astounding

If the set to missing for those variables is actually needed then it appears one approach would be to set all of them missing and then use the assignment to set the needed value. SAS provides a nice function CALL MISSING that allows setting many variables to the appropriate missing value with one call, in this case all of the array elements.

array elig {2012:2018} elig_2012 - elig_2018;
call missing ( of elig(*) );
if (2012 <= year <= 2018) then do;
   if eligible=1 then elig{year} = 1;
end;

But as others have said, I think for overall efficiency you should share some example data and describe what the desired output and logic is.

Streamlining poorly designed code isn't necessarily really efficient.

LEINAARE · Posted 11-15-2019 10:42 AM

Hi @Astounding ,

Thank you for your coding suggestion, using the array with a DO statement. This seems to be on track. I do need each variable (Elig_2012 - Elig_2018) to be accumulating counts that reset to missing at the end of each individual and year. The data are structured like this.

ID Date ….. Jan2012 Feb 2012 Mar2012 Apr2012 …. Jan2013-Dec2013 … Jan2014-Dec2014 … Jan2015-Dec2015 ...

1 Jan2012 1 0 0 0 0 0 0 0 0 0

1 Feb2012 0 1 0 0 0 0 0 0 0 0

1 Mar2012 0 0 1 0 0 0 0 0 0 0

1 Apr2012 0 0 0 1 0 0 0 0 0 0

.

1 Nov2015 0 0 0 0 0 0 0 0 0 0

1 Dec2015 0 0 0 0 0 0 0 0 0 1

Astounding · Posted 11-15-2019 10:50 AM

It can all be done, but you need to provide a clearer picture. So far, it looks like you don't have variables named YEAR or ELIGIBLE. Perhaps you could clear up the picture of what variables already exist, and then show one of the new variables such as ELIG_2015 and what values it should take on.

LEINAARE · Posted 11-15-2019 11:04 AM

Responding to @Astounding,

Sure! Sorry for the ambiguity. Below is perhaps a more comprehensive look at the structure.

ID Date Year Jan2012--Dec2012 Jan2013 Feb2013 Mar2013 Apr2013 …. Elig_2012 Elig_2013

1 01Jan2013 2013 0 ... 0 1 0 0 0 . 1

1 01Feb2013 2013 0 … 0 0 1 0 0 . 2

1 01Mar2013 2013 0 … 0 0 0 1 0 . 3

1 01Apr2013 2013 0 … 0 0 0 0 1 . 4

1 01May2013 2013 0 … 0 …………………………………………….... . 5

1 01Jun2013 2013 0 … 0 …………………………………………….... . 6

1 01Jul2013 2013 0 … 0 …………………………………………….... . 7

1 01Aug2013 2013 0 … 0 …………………………………………….... . 8

1 01Sep2013 2013 0 ... 0 …………………………………………….... . 9

1 01Dec2013 2013 0 ... 0 …………………………………………….... . 10

Tom · Posted 11-15-2019 11:24 AM

You seemed to have created a strange/bloated structure. You have both multiple rows and multiple columns.

Assuming that there is only one "1" in those gazillion variables I think you just need to add the columns.

elig_2012 = sum(elig_2012,of jan_2012--dec_2012);
elig_2013 = sum(elig_2013,of jan_2013--dec_2013);
...

Note the positional variable list will only work if the variables are actually defined in that order.

You potentially could use ARRAYS or macro code to generate those statements, but since you probably have a finite set of years it might just be easiest to just use copy and paste to create the multiple statements.

Looks you are counting how many months in that year the patient had. You probably will need to add code to reset the new variables when you start a new patient id.

if first.id then call missing(of elig_2012-elig_2019);

Astounding · Posted 11-15-2019 11:26 AM

Here's the mildly longer way just in case it is needed.

data want;
set have;
by id year;
array elig {2012:2018} elig_2012 - elig_2018;
retain elig_2012 - elig_2018;
if first.year then call missing (of elig{*} ) ;
select (year);
   when (2012) increment = max(Jan2012, Feb2012, Mar2012, Apr2012, 
          May2012, Jun2012, Jul2012, Aug2012, 
          Sep2012, Oct2012, Nov2012, Dec2012);

   when (2013) increment = max(Jan2013, Feb2013, Mar2013, Apr2013, 
          May2013, Jun2013, Jul2013, Aug2013, 
          Sep2013, Oct2013, Nov2013, Dec2013);

   when (2014) increment = max(Jan2014, Feb2014, Mar2014, Apr2014, 
          May2014, Jun2014, Jul2014, Aug2014, 
          Sep2014, Oct2014, Nov2014, Dec2014);

......

   when (2018) increment = max(Jan2012, Feb2012, Mar2012, Apr2012, 
          May2018, Jun2018, Jul2018, Aug2018, 
          Sep2018, Oct2018, Nov2018, Dec2018);

   otherwise;
end;
if (2012 <= year <= 2018) then elig {year} + increment;
run;

Could the code be shortened? There are a lot of "it depends". It depends on the order in which the variables were created. It depends on how comfortable you feel using macro language. (That wouldn't shorten the code, but it would shorten the amount of typing required to generate an identical program.) It depends on your comfort level using two-dimensional arrays. So for now here is the long, spelled out version.

LEINAARE · Posted 11-15-2019 11:52 AM

Thanks for the consideration on this. I will definitely consider these techniques if the future.

Kurt_Bremser · Posted 11-15-2019 02:07 PM

TRANSPOSE.

I mean it. Really. Stop doing Excel.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

Re: Programming Efficiency with Do Loop

SAS Innovate 2025: Call for Content