BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NickS2
Obsidian | Level 7

Hello All,

 

I need some programming help. My dataset looks like.....

ID   start_month     start_year     end_month       end_year    enrol_1_13    enrol_2_13       enrol_3_13    enrol_4_13        enrol_5_13...........................enrol_12_18

1           7                    14                   8                     15                0                    0                         1                  1                            1                                             0

2           1                    13                   5                     13                1                    1                         1                  1                            1                                             1

3           2                    13                   5                     13                0                    0                         1                  1                            1                                             0

 

Here start_month and start_year are start month and year of a disease episode and end_month and end_year are end month and year of the same disease. Now I want to see whether that particular individual was insured during the whole time he had the disease. The enrol variables here represents enrollment status for each month from 2013 to 2018 e.g. enrol_1_13 stands for enrollment status of the patient for January month of 2013.

 

For example patient ID 2 here had disease from 1/2013 to 5/2013 and his enrollment status was 1 for every month during this period, so I want to flag this patient's final enrollment status as "Y" but for patient ID 3 it should be "N" as his disease started from 2/2013 but his enrollment status for 2/2013 was 0.

 

 

 

 

       

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I would suggest associating a 2-dimensional array with your dummy variables.  The row dimension are the years 2013 through 2018, and the columns months 1 through 12:

 

data want;
  set have;
  array dummies {2013:2018,1:12}  enrol_1_13 -- enrol_12_18;

  begdate=mdy(start_month,1,2000+start_year);
  enddate=mdy(end_month,1,2000+end_year);
  
  insured='Y';  /* Initialize */
  do date=begdate to enddate while (insured='Y');
    if dummies{year(date),month(date)}=0 then insured='N';
    date=intnx('month',date,0,'end');
  end;
  drop begdate enddate;
run;

I've edited the DO loop above.  The "by 0" was not accepted, unlike some other contexts I've used it in. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

14 REPLIES 14
NickS2
Obsidian | Level 7

 

ID   start_month     start_year     end_month       end_year    enrol_1_13    enrol_2_13       enrol_3_13    enrol_4_13     enrol_5_13.................enrol_12_18

1           7                    14                   8                     15                0                    0                         1                  1                        1                                   0

2           1                    13                   5                     13                1                    1                         1                  1                        1                                   1

3           2                    13                   5                     13                0                    0                         1                  1                        1                                   0

 

 

ChrisNZ
Tourmaline | Level 20

1. This is a very lousy data layout. You'll be struggling to do any data manipulation, as you are now.

2. This does what you want:

data T;
  input ID START_MONTH START_YEAR END_MONTH END_YEAR ENROL_1_13 ENROL_2_13 ENROL_3_13 ENROL_4_13 ENROL_5_13;
  array ENROL [*] ENROL: ;
  START_INDEX = START_YEAR-13+START_MONTH;
  END_INDEX   = END_YEAR-13  +END_MONTH;
  do I=START_INDEX to END_INDEX; 
    SUM=sum(SUM,ENROL[I]);
    N  =sum(N,1);
  end;
  FLAG=SUM=N;
cards;
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 1
run;

 

NickS2
Obsidian | Level 7

Thank you Chris for your response, I tried your code but it is giving me an error "Array subscript out of range" for line 

SUM=sum(SUM,ENROL[I]);

Also, just be clear on this the variable with ENROL prefix starts from Jan 2013 (ENROL_1_13) and ends at Dec 2018 (ENROL_12_18). So there are a total 72 variables.

 

I might sound stupid as I am new to SAS programming. 

ChrisNZ
Tourmaline | Level 20

What's the value of I?
Make the array larger if you need to, but if you declare the associated variables correctly the size will be correct.

If you have end dates after 2018, you should test for that as you know the flag is 0 and there is no provision for this when the array is defined.

NickS2
Obsidian | Level 7

The value of i is 0 for the error. 

 

Although as you said I have end year 2019 as well start year 2012.

ChrisNZ
Tourmaline | Level 20

Well don't test these values:

1. You know the outcome (FLAG=0), and

2. The array can't deal with them (since it spans 2013 to 2018)

NickS2
Obsidian | Level 7

I created new variables ENROL for all months of 2012 and 2019 and assigned them value of 1, assuming that everyone had insurance for that period of time and then re-ran the code. It worked but I still can't figure out who had insurance during the tenure of their disease. Please help me figure that out.

 

ChrisNZ
Tourmaline | Level 20
??That's what the flag variable is for isn't it?
NickS2
Obsidian | Level 7

Sorry I missed the flag the first time around. It is still not giving me the correct results, it is not flagging for many observations which should have been flagged.

 

Thank you

 

Kurt_Bremser
Super User

Just to illustrate how good data design enables simple and easy to maintain code:

(taking @ChrisNZ's example dataset)

First, bring the data into shape:

data T;
  input ID START_MONTH START_YEAR END_MONTH END_YEAR ENROL_1_13 ENROL_2_13 ENROL_3_13 ENROL_4_13 ENROL_5_13;
  array ENROL [*] ENROL: ;
  START_INDEX = START_YEAR-13+START_MONTH;
  END_INDEX   = END_YEAR-13  +END_MONTH;
  do I=START_INDEX to END_INDEX; 
    SUM=sum(SUM,ENROL[I]);
    N  =sum(N,1);
  end;
  FLAG=SUM=N;
cards;
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 1
;

data patient;
set t;
keep id start_date end_date;
format start_date end_date yymmddd10.;
start_date = mdy(start_month,1,start_year);
end_date = mdy(end_month,1,end_year);
run;

proc transpose
  data=t
  out=trans (rename=(col1=enrol))
;
by id;
var enrol:;
run;

data enrol;
set trans;
format period yymmddd10.;
period = mdy(input(scan(_name_,2,'_'),2.),1,2000+input(scan(_name_,3,'_'),2.));
drop _name_;
run;

Now you have data with a small number of columns (3 each), all dates are SAS dates, and the final code basically writes itself:

data want;
merge
  patient
  enrol
;
by id;
retain enrolled;
if first.id then enrolled = 'Y';
if start_date le period le end_date and enrol = 0 then enrolled = 'N';
keep id enrolled;
if last.id;
run;

proc print data=want noobs;
run;

Result:

ID	enrolled
2	Y
3	N

With the long structure, other analysis becomes easy, like summing up the number of sick per given months, comparing months across years, and so on.

To repeat:

Maxim 33: Intelligent Data Makes for Intelligent Programs

Maxim 19: Long Beats Wide

and also Maxim 29: If in Doubt, Use Brute Force ("Brute Force" here being the simple code you need in the end), or also known as the KISS principle.

NickS2
Obsidian | Level 7

Thank you Kurt. This makes sense,  simplifying the data makes it straight-forward.

 

 

mkeintz
PROC Star

I would suggest associating a 2-dimensional array with your dummy variables.  The row dimension are the years 2013 through 2018, and the columns months 1 through 12:

 

data want;
  set have;
  array dummies {2013:2018,1:12}  enrol_1_13 -- enrol_12_18;

  begdate=mdy(start_month,1,2000+start_year);
  enddate=mdy(end_month,1,2000+end_year);
  
  insured='Y';  /* Initialize */
  do date=begdate to enddate while (insured='Y');
    if dummies{year(date),month(date)}=0 then insured='N';
    date=intnx('month',date,0,'end');
  end;
  drop begdate enddate;
run;

I've edited the DO loop above.  The "by 0" was not accepted, unlike some other contexts I've used it in. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
NickS2
Obsidian | Level 7

Hi Mkeintz,

 

Thank you for your input but it is still giving me error with 0. Log is displaying error that it is invalid do loop control information.....................

 

 

Kurt_Bremser
Super User

First, split your dataset into two datasets, one with the start/end dates, the other with the enrol flags. Transpose the enrol dataset, and convert all date-related values to SAS dates. The final code will then be a breeze.

Maxim 33: Intelligent Data Makes for Intelligent Programs.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1540 views
  • 6 likes
  • 4 in conversation