Hello!
At the moment I am working with a very wide dataset, that contains more than 500 columns (not exactly following Maxim 19).
In the dataset each row represents an individual and all their healthdata. dcode1, dcode2...dcode500.
Here is my dataset (simplified)
data have;
input
id$ dcode1$ dcode2$ dcode3$;
cards;
1 J01AA02 J01XX05 J01CA01
2 J01XX05 A10AB01 C03CA01
3 J01AA02 A10AB01 A10AB01
;
run;
I want to count how many individuals that have a diagnosis that starts with 'J01', if they have a diagnosis that starts with 'J01' they will be assigned a 1 in the new variable J01_flag, else will get 0. But if they have the diagnosis 'J01XX05' I want to exclude them from the count in J01_flag.
I want do a similar thing for every diagnosis that starts with 'C03'.
Data want;
id dcode1 dcode2 dcode3 J01_flag C03_flag
1 J01AA02 J01XX05 J01CA01 1 0
2 J01XX05 A10AB01 C03CA01 0 1
3 J01AA02 A10AB01 C03CA01 1 1
Is this possible to do with an array or any other function?
If variety is the spice of life ...
data want;
set have;
array code(500) $ dcode1-dcode500;
J01_flag=0;
C03_flag=0;
do i = 1 to 500 until (J01_flag=1 and C03_flag=1);
if code(i)=: 'J01' and code(i) ne 'J01XX05' then J01_flag=1;
if code(i)=: 'C03' and code(i) ne 'C03XX05' then C03_flag=1;
end;
drop i;
run;
There are small parts of the logic that you may need to tweak, since the question left them to the imagination. If one observation contains both a diagnosis of J01XX05 and another diagnosis of J01CA01, it seems that that should count as J01_flag=1. But perhaps not ... easy enough to change if needed.
Also, the C03 "do not use" value seems like it should be C03XX05 but perhaps that is not correct as well.
please try below code
data want;
set have;
array code(3) $ dcode1 dcode2 dcode3;
do i = 1 to 3;
if prxmatch('m/^J01/oi',code(i)) and code(i) ne 'J01XX05' then J01_flag=1;
if prxmatch('m/^C03/oi',code(i)) and code(i) ne 'J01XX05' then C03_flag=1;
end;
drop i;
run;
Here is my approach - avoiding looping
data have(drop=str);
input id$ dcode1 $ dcode2 $ dcode3 $;
array codes (*) $ dcode1-dcode3;
length str $32767;
str=cats(of codes[*]);
if (index(str,'J01XX05') gt 0) then
do;
J01_flag = 0;
C03_flag = 0;
end;
else
do;
J01_flag = count(str,'J01')**0;
C03_flag = count(str,'C03')**0;
end;
cards;
1 J01AA02 J01XX05 J01CA01
2 J01XX05 A10AB01 C03CA01
3 J01AA02 A10AB01 C03CA01
;
run;
Your code will produce false positives if the strings J01 or C03 appear within a dcode.
According to your description, ID 1 should have j01_flag as zero, as it has dcode2 as "J01XX05"; anyway, this is my code suggestion:
data have;
input
id$ dcode1$ dcode2$ dcode3$;
cards;
1 J01AA02 J01XX05 J01CA01
2 J01XX05 A10AB01 C03CA01
3 J01AA02 A10AB01 A10AB01
;
proc transpose data=have out=long (drop=_name_ rename=(col1=dcode));
by id;
var dcode:;
run;
data want (drop=dcode dc);
merge
long
long (in=no rename=(dcode=dc) where=(dc="J01XX05"))
;
by id;
retain
j01_flag
c03_flag
;
if first.id
then do;
j01_flag = 0;
c03_flag = 0;
end;
if index(dcode,'J01') = 1 then j01_flag = 1;
if index(dcode,'C03') = 1 then c03_flag = 1;
if last.id;
if no then j01_flag = 0;
run;
You can merge the result back to dataset have.
If variety is the spice of life ...
data want;
set have;
array code(500) $ dcode1-dcode500;
J01_flag=0;
C03_flag=0;
do i = 1 to 500 until (J01_flag=1 and C03_flag=1);
if code(i)=: 'J01' and code(i) ne 'J01XX05' then J01_flag=1;
if code(i)=: 'C03' and code(i) ne 'C03XX05' then C03_flag=1;
end;
drop i;
run;
There are small parts of the logic that you may need to tweak, since the question left them to the imagination. If one observation contains both a diagnosis of J01XX05 and another diagnosis of J01CA01, it seems that that should count as J01_flag=1. But perhaps not ... easy enough to change if needed.
Also, the C03 "do not use" value seems like it should be C03XX05 but perhaps that is not correct as well.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.