topic Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs in SAS Programming

Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Mon, 27 Jun 2022 04:48:53 GMT

Hi All,

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!

data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;

Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

I hope this makes sense, thanks in advance!

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

andreas_lds — Mon, 27 Jun 2022 06:09:01 GMT

Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like

category = substr(dc[i], 1, 3);
output;

inside the loop.

If you have to create those variables (code is untested):

data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j > 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN

Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Mon, 27 Jun 2022 08:00:15 GMT

Thanks very much for your informative suggestions. I will test out the syntax and report back.
Cheers!

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Ksharp — Mon, 27 Jun 2022 09:17:45 GMT

Could you provide some dummy data ?

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Reeza — Mon, 27 Jun 2022 15:53:47 GMT

Assuming you have some data with at least the following variables:

ID for each individual
25 diagnosis codes
Indicator for Exposed, non exposed (assume variable = EXPOSED).

Then:

Transpose to long format so that you have an ID, Exposed, DiagnosisNumber, DiagnosisCode
Create a new Diagnosis Variable based on Diagnosis Code, that is only the first three characters - Use SUBSTR()
Decide if you need to de-duplicate the records due to a single person/record having multiple diagnosis codes that start with the same string
Run a proc freq to see the % differences between the exposed and diagnosisCode. I think you'll be interested in the row percentages from PROC FREQ.

Untested because no data was provided but generic idea:

proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;

data long_codes;
set long;

diag_code = substr(Col1, 1, 3);
run;

proc sort data=long_codes out=long_unique;
by id diag_code;
run;

proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;

@Sazed wrote:

Hi All,

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!
data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;
Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

I hope this makes sense, thanks in advance!

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Wed, 29 Jun 2022 02:35:56 GMT

Thanks Andreas. The ICD_flags data step works perfectly and executed in about 5 sec!
I then just ran a proc tab by exposure to compare counts.

One question if you don't mind...where does the '65' in the DO loop come from?

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Wed, 29 Jun 2022 02:55:18 GMT

Ok I think its from 65 because the BYTE function maps the values to A-Z in the ASCII collating sequence.

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Wed, 29 Jun 2022 04:28:06 GMT

Here is some dummy data containing ID, DIAGNOSIS, EDIAG1-5, and EXPOSURE for 1000 obs.

Cheers

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Sazed — Wed, 29 Jun 2022 04:34:16 GMT

Thanks @Reeza .

Your code executed fine and the format of the outputted results is what I'm after, however I get less counts compared @andreas_lds code and spot checking against my code.

I'm not sure if this syntax is scanning all diagnosis fields or just the first diagnosis field.

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Ksharp — Wed, 29 Jun 2022 05:15:54 GMT

libname x v9 'c:\temp';

data have;
 set x.rand_icd(keep=diagnosis--ediag5);
run;

proc iml;
use have;
read all var _char_ into x[c=vname];
x=substr(x,1,3);
levels=unique(x);

yn=j(nrow(x),ncol(levels),.);
do i=1 to nrow(x);
 yn[i,]=element(levels,x[i,]);
end;

create yn from yn[c=levels];
append from yn;
close;
quit;

data want;
merge x.rand_icd yn;
run;