Solved: Need efficient method for creating new variables to count how often ea...

Sazed · Posted 06-27-2022 12:48 AM

Hi All,

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!

data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;

Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

I hope this makes sense, thanks in advance!

andreas_lds · Posted 06-27-2022 02:09 AM

Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like

category = substr(dc[i], 1, 3);
output;

inside the loop.

If you have to create those variables (code is untested):

data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j > 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN

Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.

View solution in original post

andreas_lds · Posted 06-27-2022 02:09 AM

Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like

category = substr(dc[i], 1, 3);
output;

inside the loop.

If you have to create those variables (code is untested):

data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j > 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN

Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.

Sazed · Posted 06-27-2022 04:00 AM

Thanks very much for your informative suggestions. I will test out the syntax and report back.
Cheers!

Sazed · Posted 06-28-2022 10:35 PM

Thanks Andreas. The ICD_flags data step works perfectly and executed in about 5 sec!
I then just ran a proc tab by exposure to compare counts.

One question if you don't mind...where does the '65' in the DO loop come from?

Sazed · Posted 06-28-2022 10:55 PM

Ok I think its from 65 because the BYTE function maps the values to A-Z in the ASCII collating sequence.

Ksharp · Posted 06-27-2022 05:17 AM

Could you provide some dummy data ?

Sazed · Posted 06-29-2022 12:28 AM

Here is some dummy data containing ID, DIAGNOSIS, EDIAG1-5, and EXPOSURE for 1000 obs.

Cheers

Reeza · Posted 06-27-2022 11:53 AM

Assuming you have some data with at least the following variables:

ID for each individual
25 diagnosis codes
Indicator for Exposed, non exposed (assume variable = EXPOSED).

Then:

Transpose to long format so that you have an ID, Exposed, DiagnosisNumber, DiagnosisCode
Create a new Diagnosis Variable based on Diagnosis Code, that is only the first three characters - Use SUBSTR()
Decide if you need to de-duplicate the records due to a single person/record having multiple diagnosis codes that start with the same string
Run a proc freq to see the % differences between the exposed and diagnosisCode. I think you'll be interested in the row percentages from PROC FREQ.

Untested because no data was provided but generic idea:

proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;

data long_codes;
set long;

diag_code = substr(Col1, 1, 3);
run;

proc sort data=long_codes out=long_unique;
by id diag_code;
run;

proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;

@Sazed wrote:

Hi All,

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!
data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;
Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

I hope this makes sense, thanks in advance!

Sazed · Posted 06-29-2022 12:34 AM

Thanks @Reeza .

Your code executed fine and the format of the outputted results is what I'm after, however I get less counts compared @andreas_lds code and spot checking against my code.

I'm not sure if this syntax is scanning all diagnosis fields or just the first diagnosis field.

Ksharp · Posted 06-29-2022 01:15 AM

libname x v9 'c:\temp';

data have;
 set x.rand_icd(keep=diagnosis--ediag5);
run;

proc iml;
use have;
read all var _char_ into x[c=vname];
x=substr(x,1,3);
levels=unique(x);

yn=j(nrow(x),ncol(levels),.);
do i=1 to nrow(x);
 yn[i,]=element(levels,x[i,]);
end;

create yn from yn[c=levels];
append from yn;
close;
quit;

data want;
merge x.rand_icd yn;
run;

Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs

Classroom Training Available!