BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LEINAARE
Obsidian | Level 7

Hello,

 

I have a series of repetitive IF, THEN, ELSE statements in my data step (shown below).  The program works fine, but I would like to be more efficient in my programming.

 

if year=2012 and eligible=1 then ELIG_2012 + 1; else if year ne 2012 then ELIG_2012=.;
if year=2013 and eligible=1 then ELIG_2013 + 1; else if year ne 2013 then ELIG_2013=.;
if year=2014 and eligible=1 then ELIG_2014 + 1; else if year ne 2014 then ELIG_2014=.;
if year=2015 and eligible=1 then ELIG_2015 + 1; else if year ne 2015 then ELIG_2015=.;
if year=2016 and eligible=1 then ELIG_2016 + 1; else if year ne 2016 then ELIG_2016=.;
if year=2017 and eligible=1 then ELIG_2017 + 1; else if year ne 2017 then ELIG_2017=.;
if year=2018 and eligible=1 then ELIG_2018 + 1; else if year ne 2018 then ELIG_2018=.;

I am trying to develop a DO loop to perform this, but cannot figure out how to increment the ELIG_ series variable names.  Could anyone offer some help with this?

 

Thanks,

 

Ted

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You seemed to have created a strange/bloated structure.  You have both multiple rows and multiple columns.

 

Assuming that there is only one "1" in those gazillion variables I think you just need to add the columns.

elig_2012 = sum(elig_2012,of jan_2012--dec_2012);
elig_2013 = sum(elig_2013,of jan_2013--dec_2013);
...

Note the positional variable list will only work if the variables are actually defined in that order.

 

You potentially could use ARRAYS or macro code to generate those statements, but since you probably have a finite set of years it might just be easiest to just use copy and paste to create the multiple statements.

 

Looks you are counting how many months in that year the patient had. You probably will need to add code to reset the new variables when you start a new patient id.

if first.id then call missing(of elig_2012-elig_2019);

 

View solution in original post

14 REPLIES 14
LEINAARE
Obsidian | Level 7

Thank you for the suggestion.  This would work if my data were structured differently.  But the dataset is wide, with individual variables for each month (Jan - Dec) of each year (2012-2018), equaling 84 dichotomous variables for month.  This is why I created accumulating variables to count the frequency of observations per year for each individual.  I could sum by individual using the means procedure, and add the values (0/1) of the month variables for each year though.

 

Thanks,

 

Ted

Tom
Super User Tom
Super User

@LEINAARE wrote:

Thank you for the suggestion.  This would work if my data were structured differently.  But the dataset is wide, with individual variables for each month (Jan - Dec) of each year (2012-2018), equaling 84 dichotomous variables for month.  This is why I created accumulating variables to count the frequency of observations per year for each individual.  I could sum by individual using the means procedure, and add the values (0/1) of the month variables for each year though.

 

Thanks,

 

Ted


Please explain what you are trying to generate.  Are you trying to determine eligibility for a YEAR based on some count of eligibility for a month?  If so what is the logic? Is it any month? all months?  some minimum number of months?

 

Is the goal just to set that year level eligibility flag?  Or are you trying to count number of patients per year?

 

To me it looks like you want to get the denominator for your usage counts.  There were XX patients diagnosed with condition YY in a particular time period out of ZZ patients who had insurance coverage over that time period.

LEINAARE
Obsidian | Level 7

Hi @Tom,

 

I am creating cohorts from Medicaid eligibility data.  In the final dataset (which I have already generated and checked for accuracy), each individual will have one observation per months from January 2012 - December 2018.  For months that individuals do not have eligibility, an eligibility flag is coded zero (Eligible = 0).  I use the MEANS procedure to sum the dichotomized values for each of the 84 month variables, and then TRANSPOSE to convert the dataset from wide to long.  All of this has already been done.

 

What I want to do is generate a count of eligible months by year, so that there are 7 variables with the count of eligible months an individual had for the corresponding year.  I could use another MEANS procedure and sum by individual and year, but I am using a data step to perform other manipulations, and wanted to output count variables using the same data step.

 

 

ballardw
Super User

@LEINAARE wrote:

Hi @Tom,

 

I am creating cohorts from Medicaid eligibility data.  In the final dataset (which I have already generated and checked for accuracy), each individual will have one observation per months from January 2012 - December 2018.  For months that individuals do not have eligibility, an eligibility flag is coded zero (Eligible = 0).  I use the MEANS procedure to sum the dichotomized values for each of the 84 month variables, and then TRANSPOSE to convert the dataset from wide to long.  All of this has already been done.

 

 

 


84 Month variables means that you may have already gone a LONG way down the "I'm stuck in Excel path".

 

If you have a date value and want to summarize a variable (or 10 or ..) you might want to consider using a FORMAT to create the groups fairly directly. This goes from "long" to "long" with a count of months (sum as you did) with eligible=1. The value of DATE is still a date, it just has a different display value after the summary. The actual underlying date value would usually be the first date for a year encountered in the data. The Drop removes a variable, _freq_ that would indicate how many records were used for each ID and might be important if you are only interested in complete years (i.e. 12 monthly values).

 

data example;
   do id = 1 to 3;
   do year= 2012 to 2018;
   do month = 1 to 12;  
      date = mdy(month,1,year);
      eligible = rand('integer',2) -1;
      output;
   end;
   end;
   end;
run;

proc summary data=example nway;
   class id date;
   format date year4.;
   var eligible;
   output out=work.eligcount (drop= _:) sum=EligibleMonths;
run;

Sometimes, especially when dealing with efficiency, it may be better to use one of the appropriate procedures and merge data back then to spend the time working on possibly quite complex logic to keep track of things through a data step.

Astounding
PROC Star

Here's an approach, but it requires a little learning.  That can't be a bad thing, can it?

 

array elig {2012:2018} elig_2012 - elig_2018;
if (2012 <= year <= 2018) then do;
   if eligible=1 then elig{year} = 1;
end;

I didn't address the "else" portion of your logic yet (setting variables back to missing), since I'm not sure it's the right logic to apply.  But if you are sure, we can revisit that.

 

Also, I changed elig + 1 to elig = 1.  I'm presuming that values of 1 or missing would be sufficient, but perhaps we need to revisit that as well.

 

Finally, we should consider what you are trying to achieve here.  It's possible you could calculate the values you are looking for quite easily with something like this:

 

proc means data=have sum n nmiss;
   var eligible;
   class year;
run;
ballardw
Super User

Adding to @Astounding 

If the set to missing for those variables is actually needed then it appears one approach would be to set all of them missing and then use the assignment to set the needed value. SAS provides a nice function CALL MISSING that allows setting many variables to the appropriate missing value with one call, in this case all of the array elements.

 

array elig {2012:2018} elig_2012 - elig_2018;
call missing ( of elig(*) );
if (2012 <= year <= 2018) then do;
   if eligible=1 then elig{year} = 1;
end;

But as others have said, I think for overall efficiency you should share some example data and describe what the desired output and logic is.

Streamlining poorly designed code isn't necessarily really efficient.

LEINAARE
Obsidian | Level 7

Hi @Astounding ,

 

Thank you for your coding suggestion, using the array with a DO statement.  This seems to be on track.  I do need each variable (Elig_2012 - Elig_2018) to be accumulating counts that reset to missing at the end of each individual and year.  The data are structured like this.

 

ID     Date   ….. Jan2012  Feb 2012  Mar2012  Apr2012 …. Jan2013-Dec2013 … Jan2014-Dec2014 … Jan2015-Dec2015 ...

1     Jan2012           1               0              0              0                  0           0                     0           0                   0              0

1     Feb2012           0               1              0              0                  0           0                     0           0                   0              0

1     Mar2012           0               0              1               0                  0           0                     0           0                   0              0

1     Apr2012            0               0              0               1                  0           0                     0           0                   0              0  

.

.

.      

1     Nov2015            0               0              0               0                  0           0                     0           0                   0              0

1     Dec2015            0               0              0               0                  0           0                     0           0                   0              1

 

 

Astounding
PROC Star

It can all be done, but you need to provide a clearer picture.  So far, it looks like you don't have variables named YEAR or ELIGIBLE.  Perhaps you could clear up the picture of what variables already exist, and then show one of the new variables such as ELIG_2015 and what values it should take on.

LEINAARE
Obsidian | Level 7

Responding to @Astounding,

 

Sure!  Sorry for the ambiguity.  Below is perhaps a more comprehensive look at the structure.

 

ID      Date            Year      Jan2012--Dec2012   Jan2013  Feb2013  Mar2013  Apr2013 …. Elig_2012   Elig_2013

1    01Jan2013      2013           0     ...      0                1              0              0               0                 .                   1

1    01Feb2013      2013          0          0                0              1              0               0                 .                   2

1    01Mar2013      2013          0          0                0               0             1               0                 .                   3

1    01Apr2013      2013           0          0                0               0             0               1                 .                   4

1    01May2013     2013           0          0       ……………………………………………....            .                   5

1    01Jun2013      2013           0          0       ……………………………………………....            .                   6

1    01Jul2013       2013           0          0       ……………………………………………....            .                   7

1    01Aug2013     2013           0          0       ……………………………………………....            .                   8

1    01Sep2013     2013           0     ...       0       ……………………………………………....            .                   9

1    01Dec2013     2013           0     ...       0       ……………………………………………....            .                   10

Tom
Super User Tom
Super User

You seemed to have created a strange/bloated structure.  You have both multiple rows and multiple columns.

 

Assuming that there is only one "1" in those gazillion variables I think you just need to add the columns.

elig_2012 = sum(elig_2012,of jan_2012--dec_2012);
elig_2013 = sum(elig_2013,of jan_2013--dec_2013);
...

Note the positional variable list will only work if the variables are actually defined in that order.

 

You potentially could use ARRAYS or macro code to generate those statements, but since you probably have a finite set of years it might just be easiest to just use copy and paste to create the multiple statements.

 

Looks you are counting how many months in that year the patient had. You probably will need to add code to reset the new variables when you start a new patient id.

if first.id then call missing(of elig_2012-elig_2019);

 

Astounding
PROC Star

Here's the mildly longer way just in case it is needed.

 

data want;
set have;
by id year;
array elig {2012:2018} elig_2012 - elig_2018;
retain elig_2012 - elig_2018; if first.year then call missing (of elig{*} ) ; select (year); when (2012) increment = max(Jan2012, Feb2012, Mar2012, Apr2012, May2012, Jun2012, Jul2012, Aug2012, Sep2012, Oct2012, Nov2012, Dec2012); when (2013) increment = max(Jan2013, Feb2013, Mar2013, Apr2013, May2013, Jun2013, Jul2013, Aug2013, Sep2013, Oct2013, Nov2013, Dec2013); when (2014) increment = max(Jan2014, Feb2014, Mar2014, Apr2014, May2014, Jun2014, Jul2014, Aug2014, Sep2014, Oct2014, Nov2014, Dec2014); ...... when (2018) increment = max(Jan2012, Feb2012, Mar2012, Apr2012, May2018, Jun2018, Jul2018, Aug2018, Sep2018, Oct2018, Nov2018, Dec2018); otherwise; end; if (2012 <= year <= 2018) then elig {year} + increment; run;

Could the code be shortened?  There are a lot of "it depends".  It depends on the order in which the variables were created.  It depends on how comfortable you feel using macro language.  (That wouldn't shorten the code, but it would shorten the amount of typing required to generate an identical program.)  It depends on  your comfort level using two-dimensional arrays.  So for now here is the long, spelled out version.

LEINAARE
Obsidian | Level 7

Thanks for the consideration on this.  I will definitely consider these techniques if the future.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 1080 views
  • 3 likes
  • 5 in conversation