Hello,
I have a list of pneumococcal vaccination administrations (n=2,400,000) and am trying to determine the best way to indicate if each person has met the criteria to be considered complete. The dataset is vertical. I have unique identifiers for each person and if that person has received more than one vaccination then they will have multiple observations listed. To be complete the person will need to meet one of these criteria:
What would be the most efficient way to handle marking individuals as complete?
Below is an example using dummy data of what I have vs what I want.
HAVE:
ID | Vaccine |
1 | PCV15 |
1 | PCV13 |
1 | PPSV23 |
2 | PCV13 |
2 | PPSV23 |
3 | PCV20 |
4 | PPSV23 |
4 | PCV15 |
5 | PCV13 |
WANT (where 1 = complete and 0 = incomplete)
ID | Pneumo Complete |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 0 |
Thank you
Hello @OlsabeckT29,
Your vertical "HAVE" dataset can also be used directly (i.e., without transposing). If it is sorted or at least grouped by ID (in the latter case use by id notsorted;), binary flags for the four relevant vaccines could be set in a DOW loop and the resulting bit pattern be evaluated after the last observation for the current ID has been processed:
data want(keep=id complete);
b=0;
do until(last.id);
set have;
by id;
select(vaccine);
when('PCV20') b=bor(b,8);
when('PPSV23') b=bor(b,4);
when('PCV13') b=bor(b,2);
when('PCV15') b=bor(b,1);
otherwise;
end;
end;
complete=b>4;
run;
One way:
Note use of data step to provide example data. You really should provide such as otherwise we guess as to variable types and may not result in code that works with your data.
The transpose places all vaccinations for id on a single record. This requires sorting the data beforehand.
With data in this form any of the combinations involving PCV20 really only use your rule of a single PCV20 is required.
data have; input ID $ Vaccine $; datalines; 1 PCV15 1 PCV13 1 PPSV23 2 PCV13 2 PPSV23 3 PCV20 4 PPSV23 4 PCV15 5 PCV13 ; proc sort data=have; by id ; run; proc transpose data=have out=trans prefix=vax; by id; var vaccine; run; data want; set trans; array v (*) vax: ; vaxall= catx(' ',of v(*)); if index(vaxall,'PCV20')>0 then complete=1; else if index(vaxall,'PCV15')>0 and index(vaxall,'PPSV23')>0 then complete=1; else if index(vaxall,'PCV13')>0 and index(vaxall,'PPSV23')>0 then complete=1; else complete=0; run;
The array is just to create a shorter code for creating a space delimited variable with all of the vaccinations in a single variable. It does not assume how many vaccinations are available. The IF block searches for combinations as described in your question. Index function returns a character position so the >0 is used to find if it is present at all. Note that the function is case sensitive. If any of your actual values are may have different character case you should use
vaxall= upcase (catx(' ',of v(*)) );
to make all of the letters upper case for the comparison.
If you have not used arrays much the " of v(*) " is a way that many functions will use all of the values in the array.
After verifying the logic works you could drop the Vax1 - VaxN and Vaxall variables.
@OlsabeckT29 wrote:
Hello,
I have a list of pneumococcal vaccination administrations (n=2,400,000) and am trying to determine the best way to indicate if each person has met the criteria to be considered complete. The dataset is vertical. I have unique identifiers for each person and if that person has received more than one vaccination then they will have multiple observations listed. To be complete the person will need to meet one of these criteria:
- PCV15 + PPSV23
- PCV20
- PPSV23 + PCV15
- PPSV23 + PCV20
- PCV13 + PPSV23
- PCV13 + PCV20
What would be the most efficient way to handle marking individuals as complete?
Below is an example using dummy data of what I have vs what I want.
HAVE:
ID Vaccine 1 PCV15 1 PCV13 1 PPSV23 2 PCV13 2 PPSV23 3 PCV20 4 PPSV23 4 PCV15 5 PCV13
WANT (where 1 = complete and 0 = incomplete)
ID Pneumo Complete 1 1 2 1 3 1 4 1 5 0
Thank you
Slightly simpler:
data want;
set trans;
if whichc('PCV20', of vax:) then complete=1;
else if whichc('PCV15',of vax:) and whichc('PPSV23',of vax:) then complete=1;
else if whichc('PCV13',of vax:) and whichc('PPSV23',of vax:) then complete=1;
else complete=0;
run;
I note that if there is also a code similar to the ones you list but with extra characters on the end, such as PCV20A, and you want complete to be zero in this case, then the solution from @ballardw will fail without additional programming.
Hello @OlsabeckT29,
Your vertical "HAVE" dataset can also be used directly (i.e., without transposing). If it is sorted or at least grouped by ID (in the latter case use by id notsorted;), binary flags for the four relevant vaccines could be set in a DOW loop and the resulting bit pattern be evaluated after the last observation for the current ID has been processed:
data want(keep=id complete);
b=0;
do until(last.id);
set have;
by id;
select(vaccine);
when('PCV20') b=bor(b,8);
when('PPSV23') b=bor(b,4);
when('PCV13') b=bor(b,2);
when('PCV15') b=bor(b,1);
otherwise;
end;
end;
complete=b>4;
run;
/*
Very interesting question
*/
data have;
input ID $ Vaccine $;
datalines;
1 PCV15
1 PCV13
1 PPSV23
2 PCV13
2 PPSV23
3 PCV20
4 PPSV23
4 PCV15
5 PCV13
;
data have;
set have;
dummy=1;
run;
proc glmselect data=have noprint outdesign(addinputvars)=temp;
class Vaccine;
model dummy= Vaccine/selection=none noint;
run;
proc summary data=temp nway;
class id;
var Vaccine_:;
output out=want(drop=_:) max=;
run;
data want;
set want;
Completed=
(Vaccine_PCV15 and Vaccine_PPSV23) or
Vaccine_PCV20 or
(Vaccine_PPSV23 and Vaccine_PCV15) or
(Vaccine_PPSV23 and Vaccine_PCV20) or
(Vaccine_PCV13 and Vaccine_PPSV23) or
(Vaccine_PCV13 and Vaccine_PCV20)
;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.