Hello All,
I need some programming help. My dataset looks like.....
ID start_month start_year end_month end_year enrol_1_13 enrol_2_13 enrol_3_13 enrol_4_13 enrol_5_13...........................enrol_12_18
1 7 14 8 15 0 0 1 1 1 0
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 0
Here start_month and start_year are start month and year of a disease episode and end_month and end_year are end month and year of the same disease. Now I want to see whether that particular individual was insured during the whole time he had the disease. The enrol variables here represents enrollment status for each month from 2013 to 2018 e.g. enrol_1_13 stands for enrollment status of the patient for January month of 2013.
For example patient ID 2 here had disease from 1/2013 to 5/2013 and his enrollment status was 1 for every month during this period, so I want to flag this patient's final enrollment status as "Y" but for patient ID 3 it should be "N" as his disease started from 2/2013 but his enrollment status for 2/2013 was 0.
I would suggest associating a 2-dimensional array with your dummy variables. The row dimension are the years 2013 through 2018, and the columns months 1 through 12:
data want;
set have;
array dummies {2013:2018,1:12} enrol_1_13 -- enrol_12_18;
begdate=mdy(start_month,1,2000+start_year);
enddate=mdy(end_month,1,2000+end_year);
insured='Y'; /* Initialize */
do date=begdate to enddate while (insured='Y');
if dummies{year(date),month(date)}=0 then insured='N';
date=intnx('month',date,0,'end');
end;
drop begdate enddate;
run;
I've edited the DO loop above. The "by 0" was not accepted, unlike some other contexts I've used it in.
ID start_month start_year end_month end_year enrol_1_13 enrol_2_13 enrol_3_13 enrol_4_13 enrol_5_13.................enrol_12_18
1 7 14 8 15 0 0 1 1 1 0
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 0
1. This is a very lousy data layout. You'll be struggling to do any data manipulation, as you are now.
2. This does what you want:
data T;
input ID START_MONTH START_YEAR END_MONTH END_YEAR ENROL_1_13 ENROL_2_13 ENROL_3_13 ENROL_4_13 ENROL_5_13;
array ENROL [*] ENROL: ;
START_INDEX = START_YEAR-13+START_MONTH;
END_INDEX = END_YEAR-13 +END_MONTH;
do I=START_INDEX to END_INDEX;
SUM=sum(SUM,ENROL[I]);
N =sum(N,1);
end;
FLAG=SUM=N;
cards;
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 1
run;
Thank you Chris for your response, I tried your code but it is giving me an error "Array subscript out of range" for line
SUM=sum(SUM,ENROL[I]);
Also, just be clear on this the variable with ENROL prefix starts from Jan 2013 (ENROL_1_13) and ends at Dec 2018 (ENROL_12_18). So there are a total 72 variables.
I might sound stupid as I am new to SAS programming.
What's the value of I?
Make the array larger if you need to, but if you declare the associated variables correctly the size will be correct.
If you have end dates after 2018, you should test for that as you know the flag is 0 and there is no provision for this when the array is defined.
The value of i is 0 for the error.
Although as you said I have end year 2019 as well start year 2012.
Well don't test these values:
1. You know the outcome (FLAG=0), and
2. The array can't deal with them (since it spans 2013 to 2018)
I created new variables ENROL for all months of 2012 and 2019 and assigned them value of 1, assuming that everyone had insurance for that period of time and then re-ran the code. It worked but I still can't figure out who had insurance during the tenure of their disease. Please help me figure that out.
Sorry I missed the flag the first time around. It is still not giving me the correct results, it is not flagging for many observations which should have been flagged.
Thank you
Just to illustrate how good data design enables simple and easy to maintain code:
(taking @ChrisNZ's example dataset)
First, bring the data into shape:
data T;
input ID START_MONTH START_YEAR END_MONTH END_YEAR ENROL_1_13 ENROL_2_13 ENROL_3_13 ENROL_4_13 ENROL_5_13;
array ENROL [*] ENROL: ;
START_INDEX = START_YEAR-13+START_MONTH;
END_INDEX = END_YEAR-13 +END_MONTH;
do I=START_INDEX to END_INDEX;
SUM=sum(SUM,ENROL[I]);
N =sum(N,1);
end;
FLAG=SUM=N;
cards;
2 1 13 5 13 1 1 1 1 1 1
3 2 13 5 13 0 0 1 1 1 1
;
data patient;
set t;
keep id start_date end_date;
format start_date end_date yymmddd10.;
start_date = mdy(start_month,1,start_year);
end_date = mdy(end_month,1,end_year);
run;
proc transpose
data=t
out=trans (rename=(col1=enrol))
;
by id;
var enrol:;
run;
data enrol;
set trans;
format period yymmddd10.;
period = mdy(input(scan(_name_,2,'_'),2.),1,2000+input(scan(_name_,3,'_'),2.));
drop _name_;
run;
Now you have data with a small number of columns (3 each), all dates are SAS dates, and the final code basically writes itself:
data want;
merge
patient
enrol
;
by id;
retain enrolled;
if first.id then enrolled = 'Y';
if start_date le period le end_date and enrol = 0 then enrolled = 'N';
keep id enrolled;
if last.id;
run;
proc print data=want noobs;
run;
Result:
ID enrolled 2 Y 3 N
With the long structure, other analysis becomes easy, like summing up the number of sick per given months, comparing months across years, and so on.
To repeat:
Maxim 33: Intelligent Data Makes for Intelligent Programs
Maxim 19: Long Beats Wide
and also Maxim 29: If in Doubt, Use Brute Force ("Brute Force" here being the simple code you need in the end), or also known as the KISS principle.
Thank you Kurt. This makes sense, simplifying the data makes it straight-forward.
I would suggest associating a 2-dimensional array with your dummy variables. The row dimension are the years 2013 through 2018, and the columns months 1 through 12:
data want;
set have;
array dummies {2013:2018,1:12} enrol_1_13 -- enrol_12_18;
begdate=mdy(start_month,1,2000+start_year);
enddate=mdy(end_month,1,2000+end_year);
insured='Y'; /* Initialize */
do date=begdate to enddate while (insured='Y');
if dummies{year(date),month(date)}=0 then insured='N';
date=intnx('month',date,0,'end');
end;
drop begdate enddate;
run;
I've edited the DO loop above. The "by 0" was not accepted, unlike some other contexts I've used it in.
Hi Mkeintz,
Thank you for your input but it is still giving me error with 0. Log is displaying error that it is invalid do loop control information.....................
First, split your dataset into two datasets, one with the start/end dates, the other with the enrol flags. Transpose the enrol dataset, and convert all date-related values to SAS dates. The final code will then be a breeze.
Maxim 33: Intelligent Data Makes for Intelligent Programs.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.