BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Sazed
Calcite | Level 5

Hi All,

 

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

 

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!

 

data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;

Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

 

I hope this makes sense, thanks in advance!

 

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like

category = substr(dc[i], 1, 3);
output;

inside the loop.

 

If you have to create those variables (code is untested):

data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j > 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN

Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.

 

View solution in original post

9 REPLIES 9
andreas_lds
Jade | Level 19

Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like

category = substr(dc[i], 1, 3);
output;

inside the loop.

 

If you have to create those variables (code is untested):

data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j > 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN

Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.

 

Sazed
Calcite | Level 5
Thanks very much for your informative suggestions. I will test out the syntax and report back.
Cheers!
Sazed
Calcite | Level 5
Thanks Andreas. The ICD_flags data step works perfectly and executed in about 5 sec!
I then just ran a proc tab by exposure to compare counts.

One question if you don't mind...where does the '65' in the DO loop come from?

Sazed
Calcite | Level 5
Ok I think its from 65 because the BYTE function maps the values to A-Z in the ASCII collating sequence.
Sazed
Calcite | Level 5

Here is some dummy data containing ID, DIAGNOSIS, EDIAG1-5, and EXPOSURE for 1000 obs.

Cheers 

Reeza
Super User

 

Assuming you have some data with at least the following variables:

 

  • ID for each individual
  • 25 diagnosis codes
  • Indicator for Exposed, non exposed (assume variable = EXPOSED).

 

Then:

  • Transpose to long format so that you have an ID, Exposed, DiagnosisNumber, DiagnosisCode
  • Create a new Diagnosis Variable based on Diagnosis Code, that is only the first three characters - Use SUBSTR()
  • Decide if you need to de-duplicate the records due to a single person/record having multiple diagnosis codes that start with the same string
  • Run a proc freq to see the % differences between the exposed and diagnosisCode. I think you'll be interested in the row percentages from PROC FREQ.

Untested because no data was provided but generic idea:

proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;

data long_codes;
set long;

diag_code = substr(Col1, 1, 3);
run;

proc sort data=long_codes out=long_unique;
by id diag_code;
run;

proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;

 


@Sazed wrote:

Hi All,

 

I want to determine the proportion of each ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).

 

I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!

 

data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;

Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.

 

I hope this makes sense, thanks in advance!

 


 

Sazed
Calcite | Level 5

Thanks @Reeza .

Your code executed fine and the format of the outputted results is what I'm after, however I get less counts compared @andreas_lds code and spot checking against my code.

I'm not sure if this syntax is scanning all diagnosis fields or just the first diagnosis field.

 

Ksharp
Super User
libname x v9 'c:\temp';

data have;
 set x.rand_icd(keep=diagnosis--ediag5);
run;

proc iml;
use have;
read all var _char_ into x[c=vname];
x=substr(x,1,3);
levels=unique(x);

yn=j(nrow(x),ncol(levels),.);
do i=1 to nrow(x);
 yn[i,]=element(levels,x[i,]);
end;

create yn from yn[c=levels];
append from yn;
close;
quit;

data want;
merge x.rand_icd yn;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2151 views
  • 1 like
  • 4 in conversation