Hi All,
I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).
I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!
data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;
/* flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;
Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.
I hope this makes sense, thanks in advance!
Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like
category = substr(dc[i], 1, 3);
output;
inside the loop.
If you have to create those variables (code is untested):
data ICD_flags;
set ICD;
array dc diagnosis ediag1-ediag20 ecode1-ecode4;
length _code_list $ 11000 _code $ 3;
retain _code_list;
drop _code_list _code i j;
/* fill _code_list, t */
if _n_ = 1 then do;
do i = 65 to 65+26;
do j = 1 to 99;
_code = cats(byte(i), put(j, z2.));
_code_list = catx(' ', _code_list, _code);
end;
end;
end;
array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
do i = 1 to dim(dc);
if not missing(dc[i]) then do;
j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');
if j > 0 then do;
flags[j] = 1;
end;
end;
end;
RUN
Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.
Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like
category = substr(dc[i], 1, 3);
output;
inside the loop.
If you have to create those variables (code is untested):
data ICD_flags;
set ICD;
array dc diagnosis ediag1-ediag20 ecode1-ecode4;
length _code_list $ 11000 _code $ 3;
retain _code_list;
drop _code_list _code i j;
/* fill _code_list, t */
if _n_ = 1 then do;
do i = 65 to 65+26;
do j = 1 to 99;
_code = cats(byte(i), put(j, z2.));
_code_list = catx(' ', _code_list, _code);
end;
end;
end;
array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
do i = 1 to dim(dc);
if not missing(dc[i]) then do;
j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');
if j > 0 then do;
flags[j] = 1;
end;
end;
end;
RUN
Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.
Here is some dummy data containing ID, DIAGNOSIS, EDIAG1-5, and EXPOSURE for 1000 obs.
Cheers
Assuming you have some data with at least the following variables:
Then:
Untested because no data was provided but generic idea:
proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;
data long_codes;
set long;
diag_code = substr(Col1, 1, 3);
run;
proc sort data=long_codes out=long_unique;
by id diag_code;
run;
proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;
@Sazed wrote:
Hi All,
I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).
I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!
data ICD_flags; set ICD; array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4; /* flag occurrences of all ICD categories */ DO i=1 to 25; if dc{i} in : ('A01') then A01=1; if dc{i} in : ('A02') then A02=1; if dc{i} in : ('A03') then A03=1; if dc{i} in : ('A04') then A04=1; if dc{i} in : ('A05') then A05=1; if dc{i} in : ('A06') then A06=1; if dc{i} in : ('A07') then A07=1; if dc{i} in : ('A08') then A08=1; if dc{i} in : ('A09') then A09=1; if dc{i} in : ('A10') then A10=1; if dc{i} in : ('A11') then A11=1; if dc{i} in : ('A12') then A12=1; if dc{i} in : ('A13') then A13=1; if dc{i} in : ('A14') then A14=1; if dc{i} in : ('A15') then A15=1; if dc{i} in : ('A16') then A16=1; ... ... if dc{i} in : ('Z97') then Z97=1; if dc{i} in : ('Z98') then Z98=1; if dc{i} in : ('Z99') then Z99=1; END; RUN;
Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.
I hope this makes sense, thanks in advance!
Thanks @Reeza .
Your code executed fine and the format of the outputted results is what I'm after, however I get less counts compared @andreas_lds code and spot checking against my code.
I'm not sure if this syntax is scanning all diagnosis fields or just the first diagnosis field.
libname x v9 'c:\temp';
data have;
set x.rand_icd(keep=diagnosis--ediag5);
run;
proc iml;
use have;
read all var _char_ into x[c=vname];
x=substr(x,1,3);
levels=unique(x);
yn=j(nrow(x),ncol(levels),.);
do i=1 to nrow(x);
yn[i,]=element(levels,x[i,]);
end;
create yn from yn[c=levels];
append from yn;
close;
quit;
data want;
merge x.rand_icd yn;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.