Hi there!
I'm trying to count the number of times coverage for various drugs (all of which have 0s for 'not covered' and 1 for 'covered) over time in each state (which its own categorical variable). I also have two date variables (mmddyyyy). Each date represents a new entry for coverage status for each drug in each state. There is no pattern to the entries for each state; instead, every time a drug's coverage changes, a new entry is made. There could be an entry on 10/01/2020 on which Montana's Medicaid covers say, alprazolam (i.e., alprazolman_var=1), but on 10/21/2020, alprazolam_var=0. This happens for each drug in the dataset and for each state. I'm just trying to count (1) the number of 'additions,' (i.e., drug_var=1 after drug_var was 0 in a previous entry for the state) and (2) subtractions (i.e., drug_var=0 after drug_var was 1 in a previous entry for the state) over time. I've tried using a lag function. I'd really like to be able to code for this rather than count manually!
Depends a bit on your data structure - please show a sample of your data. For example, is it structured as:
ID Drug Date Flag
1 Drug1 20140104 1
1 Drug1 20140403 0
Or you could have multiple variables for each flag?
If you want to count switches use the, first sort by DRUG, DATE.
Use the NOTSORTED option on the BY statement in a data step and include the FLAG in your BY statement in the data step. Every time you hit the first.FLAG in the group you'll have a switch so increment that and output it at the last record.
If you want a code example please provide a data example.
One suspects that a State variable should be included and probably the first sort order variable.
Also please confirm ... how should we handle the first observation for each drug? Does it count as a switch, or should it be ignored.
Unless you want only generalized answers, show a sample of your data, in the form of a working data step. I'm especially curious about your statement that you have two date variables. What would be the purpose of the second data variable? And show what you want the results to look like.
data have;
input State Edate Vthrudate MATBupie MATMeth MATNal;
datalines;
Alabama 08012017 09302017 0 0 1
Alabama 10012017 10012017 1 0 1
Alabama 10022017 01012020 1 0 0
Alaska 09162018 01012019 1 0 0
Alaska 01022019 10032019 1 1 1
Alaska 10042019 01012020 1 0 1
Alaska 01022019 07032020 1 1 1
Michigan 11032020 11042020 1 1 0
Michigan 11052020 04172021 0 1 1
;
run;
What I'm looking for is a way to count up the number of times a drug's (MATBupie - MATNal) coverage is added (i.e., 0->1) or removed (i.e., 1->0) for each drug per day according to the date of entry. For example, I'm looking to track the frequency in which MATMeth's coverage in Alaska goes from 0 to 1 according to the date of entry. I'm going to calculate something like an incidence rate for coverage change. Does this all make sense? It's not as simple as counting the 0s and 1s at the most recent entry date; doing so will provide me only a cross-section of coverage rather than a method of prediction, per se. Furthermore, I need a way to do all this for drugs as individuals; not a blanket indication across all drugs per entry per state.
I think that you want the number of changes per state, which would mean you want an output dataset, or report, with one observation, or one line, per state.
Your data step is not a working data step, so the program below is untested:
proc sort data=have;
by state edate;
run;
data want (keep=state flip:);
set have;
by state;
array drug matbupie metmeth matnal;
array flips flip_matbupie flip_matmeth flip_matnal;
if first.state then do over flips;
flips=0;
end;
do over drug;
flips+ (first.state=0 and (dif(drug) in (1,-1)));
end;
if last.state;
run;
The DIF function is defined as DIF(x) = X - LAG(x), which for your sequence of dummy variables will return a 1, 0, or -1. (Except for the first observation where LAG(x) is missing, so the first DIF(x) is also missing).
Question: for Connecticut, the output data has a value of 2 for the effective date (1/1/2020). Does this mean that there were two total flips, regardless of direction, over the course of data collection? Or is the code erroneously summing two '1s'? Additionally, when I in (1, 0, -1), the count for each drug flip goes much higher. Why is that?
@mkeintz Following up on this.
I see no sample data for Connecticut, so I have no way of answering your question.
So you want one number per state per drug?
First let's fix your data step to create an actual SAS dataset.
data have;
input State :$20. Edate :mmddyy. Vthrudate :mmddyy. MATBupie MATMeth MATNal;
format edate vthrudate yymmdd10.;
datalines;
Alabama 08012017 09302017 0 0 1
Alabama 10012017 10012017 1 0 1
Alabama 10022017 01012020 1 0 0
Alaska 09162018 01012019 1 0 0
Alaska 01022019 10032019 1 1 1
Alaska 10042019 01012020 1 0 1
Alaska 01022019 07032020 1 1 1
Michigan 11032020 11042020 1 1 0
Michigan 11052020 04172021 0 1 1
;
Then let's transpose it so that the drug name gets out of the NAME of the variable into the VALUE of a variable instead.
And then re-order the results.
proc transpose data=have out=tall name=drug;
by state edate vthrudate notsorted;
var mat: ;
run;
proc sort data=tall;
by state drug edate ;
run;
Now we can use the FIRST flag to count how many times the value changes per state per drug.
data want;
do until (last.drug);
set tall;
by state drug col1 notsorted;
switch=sum(switch,first.col1 and not first.drug);
end;
keep state drug switch ;
run;
Results:
Obs State drug switch 1 Alabama MATBupie 1 2 Alabama MATMeth 0 3 Alabama MATNal 1 4 Alaska MATBupie 0 5 Alaska MATMeth 2 6 Alaska MATNal 1 7 Michigan MATBupie 1 8 Michigan MATMeth 0 9 Michigan MATNal 1
Here's one guess, could be refined if the data is not as expected:
proc sort data=have;
by state drug date;
run;
data want;
set have;
by state drug notsorted;;
if first.drug then do;
switch_to_1=0;
switch_to_0=0;
end;
if first.drug and not first.state then do;
if coverage=1 then switch_to_1 + 1;
else if coverage=0 then switch_to_0 + 1;
end;
if last.drug;
run;
You may have to adjust for the actual variable names, and for any changes you would prefer to see for the structure of the results. Here, there will be 1 observation per state/drug combination. You would probably want to limit the variables you keep, since the observation still contains the DATE and COVERAGE information from the final observation.
Is there a way to code all of this without having to run a copy for each drug? I have >20.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.