<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820547#M323881</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Assuming you have some data with at least the following variables:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;ID for each individual&lt;/LI&gt;
&lt;LI&gt;25 diagnosis codes&lt;/LI&gt;
&lt;LI&gt;Indicator for Exposed, non exposed (assume variable = EXPOSED).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Transpose to long format so that you have an ID, Exposed, DiagnosisNumber, DiagnosisCode&lt;/LI&gt;
&lt;LI&gt;Create a new Diagnosis Variable based on Diagnosis Code, that is only the first three characters - Use SUBSTR()&lt;/LI&gt;
&lt;LI&gt;Decide if you need to de-duplicate the records due to a single person/record having multiple diagnosis codes that start with the same string&lt;/LI&gt;
&lt;LI&gt;Run a proc freq to see the % differences between the exposed and diagnosisCode. I think you'll be interested in the row percentages from PROC FREQ.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Untested because no data was provided but generic idea:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;

data long_codes;
set long;

diag_code = substr(Col1, 1, 3);
run;

proc sort data=long_codes out=long_unique;
by id diag_code;
run;

proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/428481"&gt;@Sazed&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hi All,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I want to determine the proportion of&amp;nbsp;&lt;U&gt;each&lt;/U&gt; ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=""&gt;data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I hope this makes sense, thanks in advance!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 27 Jun 2022 15:53:47 GMT</pubDate>
    <dc:creator>Reeza</dc:creator>
    <dc:date>2022-06-27T15:53:47Z</dc:date>
    <item>
      <title>Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820457#M323836</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to determine the proportion of&amp;nbsp;&lt;U&gt;each&lt;/U&gt; ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=""&gt;data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I hope this makes sense, thanks in advance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jun 2022 04:48:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820457#M323836</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-27T04:48:53Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820462#M323839</link>
      <description>&lt;P&gt;Creating 2574 seems to be a bad idea, but maybe this is what you need for the next steps. I doubt it. Creating only one category variable and multiple observations could even reduce the code in subsequent steps. In this case, all you would need is something like&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;category = substr(dc[i], 1, 3);
output;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;inside the loop.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have to create those variables (code is untested):&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data ICD_flags;
   set ICD;
   array dc diagnosis ediag1-ediag20 ecode1-ecode4;
   
   length _code_list $ 11000 _code $ 3;
   
   retain _code_list;
   drop _code_list _code i j;
   
   /* fill _code_list, t */
   if _n_ = 1 then do;
      do i = 65 to 65+26; 
         do j = 1 to 99;
            _code = cats(byte(i), put(j, z2.));
            _code_list = catx(' ', _code_list, _code);
         end;
      end;
   end;

   array flags A01-A99 B01-B99 C01-C99 /* !! Won't work without adding all ranges here ... Z01-Z99 */;
   
   do i = 1 to dim(dc);
      if not missing(dc[i]) then do;
         j = findw(_code_list, substr(dc[i], 1, 3), ' ', 'e');         
         if j &amp;gt; 0 then do;
            flags[j] = 1;
         end;
      end;
   end;

RUN&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Idea: The _code_list contains the variable names in the same order they have in the array flags, the function findw with option "e" returns the position of the first three chars of the icd in _code_list which is the index of the flag variable in the array.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jun 2022 06:09:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820462#M323839</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2022-06-27T06:09:01Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820475#M323842</link>
      <description>Thanks very much for your informative suggestions. I will test out the syntax and report back.&lt;BR /&gt;Cheers!</description>
      <pubDate>Mon, 27 Jun 2022 08:00:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820475#M323842</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-27T08:00:15Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820483#M323849</link>
      <description>Could you provide some dummy data ?</description>
      <pubDate>Mon, 27 Jun 2022 09:17:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820483#M323849</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2022-06-27T09:17:45Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820547#M323881</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Assuming you have some data with at least the following variables:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;ID for each individual&lt;/LI&gt;
&lt;LI&gt;25 diagnosis codes&lt;/LI&gt;
&lt;LI&gt;Indicator for Exposed, non exposed (assume variable = EXPOSED).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Transpose to long format so that you have an ID, Exposed, DiagnosisNumber, DiagnosisCode&lt;/LI&gt;
&lt;LI&gt;Create a new Diagnosis Variable based on Diagnosis Code, that is only the first three characters - Use SUBSTR()&lt;/LI&gt;
&lt;LI&gt;Decide if you need to de-duplicate the records due to a single person/record having multiple diagnosis codes that start with the same string&lt;/LI&gt;
&lt;LI&gt;Run a proc freq to see the % differences between the exposed and diagnosisCode. I think you'll be interested in the row percentages from PROC FREQ.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Untested because no data was provided but generic idea:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc transpose data=have out=long;
by id exposed;
var diagnosis ediag1-ediag20 ecode1-ecode4;
run;

data long_codes;
set long;

diag_code = substr(Col1, 1, 3);
run;

proc sort data=long_codes out=long_unique;
by id diag_code;
run;

proc freq data=long_unique;
table diag_code*exposed /out = want outpercent;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/428481"&gt;@Sazed&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hi All,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I want to determine the proportion of&amp;nbsp;&lt;U&gt;each&lt;/U&gt; ICD10 category that appears in a dataset of approx 100,000 records containing 25 diagnosis variables (character), and compare these between exposed and non-exposed individuals. This would allow me to determine, for example, X% of exposed people have F10 diagnosis and X% of non-exposed people have F10 diagnosis (or any other diagnosis of interest).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm currently doing this by writing a line of code for each new variable (A to Z by 1 to 99) which equates to 26x99 = 2574 new ICD variables (see code except below). This method is proving to be very slow and take a lot of processing time!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=""&gt;data ICD_flags;
set ICD;
array dc{25} diagnosis ediag1-ediag20 ecode1-ecode4;

/*  flag occurrences of all ICD categories */
DO i=1 to 25;
if dc{i} in : ('A01') then A01=1;
if dc{i} in : ('A02') then A02=1;
if dc{i} in : ('A03') then A03=1;
if dc{i} in : ('A04') then A04=1;
if dc{i} in : ('A05') then A05=1;
if dc{i} in : ('A06') then A06=1;
if dc{i} in : ('A07') then A07=1;
if dc{i} in : ('A08') then A08=1;
if dc{i} in : ('A09') then A09=1;
if dc{i} in : ('A10') then A10=1;
if dc{i} in : ('A11') then A11=1;
if dc{i} in : ('A12') then A12=1;
if dc{i} in : ('A13') then A13=1;
if dc{i} in : ('A14') then A14=1;
if dc{i} in : ('A15') then A15=1;
if dc{i} in : ('A16') then A16=1;
...
...
if dc{i} in : ('Z97') then Z97=1;
if dc{i} in : ('Z98') then Z98=1;
if dc{i} in : ('Z99') then Z99=1;
END;
RUN;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Is there a more efficient way of doing this? I thought an alternative could be creating a new two dimensional array where each cell represents a diagnosis category. For example, each row represents ICD letters A to Z, and each column represents code categories 1 to 99. But I don't know how to do this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I hope this makes sense, thanks in advance!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jun 2022 15:53:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820547#M323881</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2022-06-27T15:53:47Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820842#M324024</link>
      <description>Thanks Andreas. The ICD_flags data step works perfectly and executed in about 5 sec!&lt;BR /&gt;I then just ran a proc tab by exposure to compare counts.&lt;BR /&gt;&lt;BR /&gt;One question if you don't mind...where does the '65' in the DO loop come from?&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 29 Jun 2022 02:35:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820842#M324024</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-29T02:35:56Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820844#M324025</link>
      <description>Ok I think its from 65 because the BYTE function maps the values to A-Z in the ASCII collating sequence.</description>
      <pubDate>Wed, 29 Jun 2022 02:55:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820844#M324025</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-29T02:55:18Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820853#M324028</link>
      <description>&lt;P&gt;Here is some dummy data containing ID, DIAGNOSIS, EDIAG1-5, and EXPOSURE for 1000 obs.&lt;/P&gt;&lt;P&gt;Cheers&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2022 04:28:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820853#M324028</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-29T04:28:06Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820857#M324029</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13879"&gt;@Reeza&lt;/a&gt;&amp;nbsp;.&lt;/P&gt;&lt;P&gt;Your code executed fine and the format of the outputted results is what I'm after, however I get less counts compared&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/15475"&gt;@andreas_lds&lt;/a&gt;&amp;nbsp;code and spot checking against my code.&lt;/P&gt;&lt;P&gt;I'm not sure if this syntax is scanning all diagnosis fields or just the first diagnosis field.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2022 04:34:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820857#M324029</guid>
      <dc:creator>Sazed</dc:creator>
      <dc:date>2022-06-29T04:34:16Z</dc:date>
    </item>
    <item>
      <title>Re: Need efficient method for creating new variables to count how often each ICD-10 category occurs</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820861#M324033</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;libname x v9 'c:\temp';

data have;
 set x.rand_icd(keep=diagnosis--ediag5);
run;

proc iml;
use have;
read all var _char_ into x[c=vname];
x=substr(x,1,3);
levels=unique(x);

yn=j(nrow(x),ncol(levels),.);
do i=1 to nrow(x);
 yn[i,]=element(levels,x[i,]);
end;

create yn from yn[c=levels];
append from yn;
close;
quit;

data want;
merge x.rand_icd yn;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 29 Jun 2022 05:15:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-efficient-method-for-creating-new-variables-to-count-how/m-p/820861#M324033</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2022-06-29T05:15:54Z</dc:date>
    </item>
  </channel>
</rss>

