<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Hash table search for multiple columns in SAS in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562770#M157692</link>
    <description>&lt;P&gt;Hash tables only use direct key equality matches, not fuzzy matches like a regular expression.&lt;/P&gt;
&lt;P&gt;I am unsure that iterating would be faster than iterating through an array.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I suspect your run time is CPU bound. Is it? Can you show the log report of the data step run?&lt;/P&gt;
&lt;P&gt;Regular expressions are very costly in CPU resources, and if the execution time is limited by the CPU, you should try to apply the regex only after a simpler test.&lt;/P&gt;
&lt;P&gt;Something like&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;if index(cols[I],'F') then
  if prxmatch('/F[0-9].*[123]/', cols[I]) then
   output;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Alter the test to match your data.&lt;/P&gt;
&lt;P&gt;You can also trim the string to speed up the match if you expect spaces at the end.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 31 May 2019 05:01:37 GMT</pubDate>
    <dc:creator>ChrisNZ</dc:creator>
    <dc:date>2019-05-31T05:01:37Z</dc:date>
    <item>
      <title>Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562765#M157688</link>
      <description>&lt;P&gt;I have a data with about 40M rows. There are 50 columns that I would like to extract strings from them. I have used normal data step with array to perform the task, but it took over 2 hours to finish the extraction.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know how to use hash table in SAS for simple join or subsetting by specifying a lookup table first. However, I prefer to use regular expression to do the extraction here. The current extraction is using the codes something like below.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I conduct hash table search in those 50 columns in SAS without an lookup table?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;data want;
   set have;
   array cols {*} $ col1 - col50;

   do i = 1 to dim(cols)
      if prxmatch('/F[0-9].*[123]/', cols[i])
         then output;
   end;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 31 May 2019 04:36:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562765#M157688</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-05-31T04:36:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562770#M157692</link>
      <description>&lt;P&gt;Hash tables only use direct key equality matches, not fuzzy matches like a regular expression.&lt;/P&gt;
&lt;P&gt;I am unsure that iterating would be faster than iterating through an array.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I suspect your run time is CPU bound. Is it? Can you show the log report of the data step run?&lt;/P&gt;
&lt;P&gt;Regular expressions are very costly in CPU resources, and if the execution time is limited by the CPU, you should try to apply the regex only after a simpler test.&lt;/P&gt;
&lt;P&gt;Something like&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;if index(cols[I],'F') then
  if prxmatch('/F[0-9].*[123]/', cols[I]) then
   output;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Alter the test to match your data.&lt;/P&gt;
&lt;P&gt;You can also trim the string to speed up the match if you expect spaces at the end.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2019 05:01:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562770#M157692</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-05-31T05:01:37Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562771#M157693</link>
      <description>&lt;P&gt;Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I don't have the "index" line as you suggested, I will try to test it and report back later.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2019 05:18:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562771#M157693</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-05-31T05:18:55Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562772#M157694</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression.&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Read my reply again, I think you missed my point.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2019 05:28:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562772#M157694</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-05-31T05:28:57Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562787#M157702</link>
      <description>Could you match on multiple variables?  If so, would you want to output the same observation multiple times?  If not, you could cut the time roughly in half by ending the DO loop once a match has been found.</description>
      <pubDate>Fri, 31 May 2019 07:16:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562787#M157702</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2019-05-31T07:16:14Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562835#M157721</link>
      <description>&lt;P&gt;Using Perl&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;regular expression is not efficient way.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The most efficient way is using native SAS function:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;if cols[i] in:  ('F0' 'F1' ..... 'F9') and&lt;BR /&gt;   substr( cols[i],length(cols[i])-1 ) in ('1' '2' '3')
         then output;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 31 May 2019 13:40:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562835#M157721</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2019-05-31T13:40:58Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562863#M157734</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/133203"&gt;@windlove&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I agree with&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/18408"&gt;@Ksharp&lt;/a&gt;&amp;nbsp;that simpler SAS functions perform better than PRXMATCH in this case. However, my interpretation of your regular expression suggested a somewhat more involved algorithm. Still, it was about 3.6 times faster (17 s vs. 61 s) than the regex approach on my test dataset with &lt;STRONG&gt;4&lt;/STRONG&gt; million observations (see log under the spoiler). Note that the run time will depend on how many times (i.e. 0, ..., 50) the pattern is found per observation. In my test dataset this number was about 5.4 on average. If this was comparable to your dataset (which does not seem to be the case, given your run time), your run times should be about 10 times higher.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/* Create test data for demonstration (only 4 million obs.) */

data have;
call streaminit(27182818);
array col[50] $;
do id=1 to 4e6;
  do _n_=1 to dim(col);
    col[_n_]=put(rand('uniform',2**32),hex8.);
  end;
  output;
end;
run;

/* Original approach using PRXMATCH */

data want0;
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
end;
run;

/* Alternative approach using FINDC function */

data want(drop=_:);
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  _pos=findc(cols[i],'F');
  do while(0&amp;lt;_pos&amp;lt;7);
    _pos+1;
    _c=char(cols[i],_pos);
    if _c='F' then continue;
    else if '0'&amp;lt;=_c&amp;lt;='9' then do;
      if findc(cols[i],'123',_pos+1) then output;
      leave;
    end;
    else do;
      _pos=findc(cols[i],'F',_pos+1);
      continue;
    end;
  end;
end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Of course, WANT and WANT0 are identical.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit: My test dataset contains strings of length &lt;STRONG&gt;8&lt;/STRONG&gt; in col1-col50. If your variables are longer (or shorter), the upper bound in the WHILE condition &lt;FONT face="courier new,courier"&gt;0&amp;lt;_pos&amp;lt;7&lt;/FONT&gt; should be the length minus 1 (instead of 7) or, to be more flexible, use the actual length minus 1:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;do while(0&amp;lt;_pos&amp;lt;length(cols[i])-1);&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Log:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-SPOILER&gt;
&lt;PRE&gt;1    /* Create test data for demonstration (only 4 million obs.) */
2
3    data have;
4    call streaminit(27182818);
5    array col[50] $;
6    do id=1 to 4e6;
7      do _n_=1 to dim(col);
8        col[_n_]=put(rand('uniform',2**32),hex8.);
9      end;
10     output;
11   end;
12   run;

NOTE: The data set WORK.HAVE has 4000000 observations and 51 variables.
NOTE: DATA statement used (Total process time):
      real time           14.70 seconds
      cpu time            14.68 seconds


13
14   /* Original approach using PRXMATCH */
15
16   data want0;
17   set have;
18   array cols[*] $ col1-col50;
19   do i=1 to dim(cols);
20     if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
21   end;
22   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT0 has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           1:00.98
      cpu time            1:00.96


23
24   /* Alternative approach using FINDC function */
25
26   data want(drop=_:);
27   set have;
28   array cols[*] $ col1-col50;
29   do i=1 to dim(cols);
30     _pos=findc(cols[i],'F');
31     do while(0&amp;lt;_pos&amp;lt;7);
32       _pos+1;
33       _c=char(cols[i],_pos);
34       if _c='F' then continue;
35       else if '0'&amp;lt;=_c&amp;lt;='9' then do;
36         if findc(cols[i],'123',_pos+1) then output;
37         leave;
38       end;
39       else do;
40         _pos=findc(cols[i],'F',_pos+1);
41         continue;
42       end;
43     end;
44   end;
45   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           17.06 seconds
      cpu time            17.06 seconds&lt;/PRE&gt;
&lt;/LI-SPOILER&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2019 14:49:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562863#M157734</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-05-31T14:49:31Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562894#M157746</link>
      <description>&lt;P&gt;Why do you have 50 columns to check?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If they have the same type of information why not restructure those 50 variables into 50 (up to 50) observations instead?&lt;/P&gt;
&lt;P&gt;Depending on other features of you data you might actually find that the vertical structure takes less space.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do really want to make replicates of the whole observation when the pattern is matched in multiple variables?&amp;nbsp; You could end up with output that is 50 times the size of the input.&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2019 16:41:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/562894#M157746</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2019-05-31T16:41:25Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563082#M157804</link>
      <description>&lt;P&gt;Thanks for all the responses. It's not only the F codes I want to search, but the ones below. If without using the regular expression, it would be quite messy. And for all 50 columns, I want all the columns returned as long as the column contains any of the codes, rather than return at first match.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;F.*|G30.*[0189]|G47.[01289]|O99.*[3]|&lt;BR /&gt;R44.*[0-38]|R45.*[014]|R48.*[0128]|&lt;BR /&gt;Z00.*4|Z03.*2|Z04.*6|Z09.*3|Z13.*3|&lt;BR /&gt;Z50.*[23]|Z54.*3|Z61.*9|Z63.*[189]|&lt;BR /&gt;Z65.*[89]|Z71.*[45]|Z76.*0|&lt;BR /&gt;Q90.*[0-29]|Q91.*[0-7]|&lt;BR /&gt;Q93.*|Q99.*2|P04.*3|Q86.*[0128]|&lt;BR /&gt;Q87.*[0-358]|Q89.*8|&lt;BR /&gt;X[67][0-9].*|X8[0-4].*|R45.*81|&lt;BR /&gt;K70.*|Y9[01].*|T51.*|X45.*|Y15.*|&lt;BR /&gt;R78.*0|Z72.*1|T40.*7|T43.*6[0129]&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2019 11:25:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563082#M157804</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-06-01T11:25:39Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563083#M157805</link>
      <description>Are these the expressions you use? &lt;BR /&gt;</description>
      <pubDate>Sat, 01 Jun 2019 11:36:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563083#M157805</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-06-01T11:36:58Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563087#M157806</link>
      <description>&lt;P&gt;The very first component expression, &lt;FONT face="courier new,courier"&gt;F.*&lt;/FONT&gt;, and several others end with &lt;FONT face="courier new,courier"&gt;.*&lt;/FONT&gt;&amp;nbsp;-- this is redundant. I suspect that you mean "codes that &lt;EM&gt;start with&lt;/EM&gt;&amp;nbsp;F" and not "codes that&amp;nbsp;&lt;EM&gt;contain&lt;/EM&gt; an F". These criteria might be equivalent for your codes in terms of search results (e.g. if all codes are single letters followed by a sequence of digits), but not performance wise. The appropriate regex would be &lt;FONT face="courier new,courier"&gt;&lt;STRONG&gt;^&lt;/STRONG&gt;F&lt;/FONT&gt;, but a traditional SAS criterion such as &lt;FONT face="courier new,courier"&gt;cols[i]&lt;STRONG&gt;=:'F'&lt;/STRONG&gt;&lt;/FONT&gt; would still perform much better.&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2019 12:58:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563087#M157806</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-06-01T12:58:49Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563090#M157807</link>
      <description>&lt;P&gt;When you output an observation, you are outputting all the columns (both those that match and those that don't match).&amp;nbsp; If you find a second match, and output the observation again, you are outputting all the same columns a second time.&amp;nbsp; Did you want to change that result?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2019 13:59:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563090#M157807</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2019-06-01T13:59:48Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563097#M157809</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/133203"&gt;@windlove&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The following program should give you improved performance while keeping the coding very manageable. This is a compromise and more performance improvements would be possible but I believe would potentially require a lot of extra coding.&lt;/P&gt;
&lt;P&gt;I've made the assumption that your patterns always must match starting from source string position 1.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  col2='F09324823O'; OUTPUT;
  col2='G30XXXXXXX'; OUTPUT;
  col2='G30XXXX9XX'; OUTPUT;
  col2='O99XXX4XX5'; OUTPUT;
  col2='O99XXX4X3X'; OUTPUT;
run;

data prxPattern;
  infile datalines truncover dlm=',';
  input prxKey :$1. prxPattern :$40.;
  datalines4;
F,/^F/
G,/^G30.*[0189]/
G,/^G47.[01289]/
O,/^O99.*[3]/
R,/^R44.*[0-38]/
R,/^R45.*[014]/
R,/^R48.*[0128]/
;;;;

/* &amp;amp;max_len: max length of variables used in array */
%let max_len=;
proc sql noprint;
  select max(length) into :max_len trimmed
  from dictionary.columns
  where libname='WORK' and memname='HAVE'
        and upcase(name) like 'COL%'
  ;
quit;
%put &amp;amp;=max_len;


data want(keep=id prxPattern var_name var_value compress=yes);

  set have(keep=col:);
  array cols {*} $ col:;
  id=_n_; /* ideally don't create ID but add an existing primary key variable to the keep statements */
  length var_name $32 ; 
  length var_value $&amp;amp;max_len; /* &amp;amp;max_len: max length of variables used in array */

  if _n_=1 then
    do;
      if 0 then set prxPattern;
      length prxID 3;
      dcl hash h1(multidata:'Y', hashexp:5);
      h1.defineKey('prxKey');
      h1.defineData('prxID','prxPattern');
      h1.defineDone();
      /* compile RegEx and load pointer (prxID) into hash */
      do while(not last);
        set prxPattern end=last;
        prxID=prxparse(prxPattern);
        h1.add();
      end;
    end;

  do _i=1 to dim(cols);
    prxKey=substrn(cols[_i],1,1);
    h1.reset_dup();
    do while(h1.do_over()=0);
      if prxmatch(prxID,trim(cols[_i])) then
        do;
          var_name=vname(cols[_i]);
          var_value=cols[_i];
          output;
          leave;
        end;
    end;
  end;

run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;N.B: The assumption that the pattern must match from position one not only allowed to define the "key" and though reduce the number of pattern tests per string, it also allowed to amend the RegEx with a ^ so that the search doesn't look for a pattern match in the middle of a string. These two measures should improve performance quite a bit as compared to your initial code.&lt;/P&gt;</description>
      <pubDate>Sun, 02 Jun 2019 02:19:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563097#M157809</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2019-06-02T02:19:56Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563189#M157843</link>
      <description>&lt;P&gt;Thanks for all the responses.&amp;nbsp; First, I think I made a mistake in my original code, which should add "leave" after "output" inside "do" loop, so that it won't return duplicated rows.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Second, I have changed my strategy. I isolated those 50 columns from the main data, and transposed them into a single variable (not using proc transpose due to missing values for each row, otherwise proc transpose immediately gave insufficient memory error). Then I de-duplicated this variable and extracted the codes I want to form a look-up table.&amp;nbsp; Then I defined hash object to subset the data that contains those codes.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now the subsetting only takes less than 40 mins with hash table. And the previous steps only need to be run once. If any code needs to be changed, I only need to extract the codes I want from that single variable, which only takes a second + hash table subsetting.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, within hash table, I still use "do" loop. If anyone can suggest something else to further improve the performance, that would be great. Here are the hash table subsetting below.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE class=" language-sas"&gt;&lt;CODE class="  language-sas"&gt;&lt;SPAN class="token procnames"&gt;data&lt;/SPAN&gt; want&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

   &lt;SPAN class="token keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token number"&gt;0&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;then&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;set&lt;/SPAN&gt; have&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

   &lt;SPAN class="token function"&gt;length&lt;/SPAN&gt; codes &lt;SPAN class="token punctuation"&gt;$&lt;/SPAN&gt;&lt;SPAN class="token number"&gt;8&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
		
   &lt;SPAN class="token keyword"&gt;if&lt;/SPAN&gt; _n_ &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token number"&gt;1&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;then&lt;/SPAN&gt; do&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
   &lt;SPAN class="token keyword"&gt;declare&lt;/SPAN&gt; hash mh&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;dataset: &lt;SPAN class="token string"&gt;'mh_codes'&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
		mh&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;defineKey&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token string"&gt;'codes'&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
		mh&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;defineDone&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
	   call &lt;SPAN class="token function"&gt;missing&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;codes&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
   end&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

   do until&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;last&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
      &lt;SPAN class="token keyword"&gt;set&lt;/SPAN&gt; have end &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; last&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
      &lt;SPAN class="token statement"&gt;array&lt;/SPAN&gt; cols &lt;SPAN class="token punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token operator"&gt;*&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;}&lt;/SPAN&gt; &lt;SPAN class="token punctuation"&gt;$&lt;/SPAN&gt; col1 &lt;SPAN class="token operator"&gt;-&lt;/SPAN&gt; col50&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
      do i &lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token number"&gt;1&lt;/SPAN&gt; to &lt;SPAN class="token function"&gt;dim&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;cols&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
         &lt;SPAN class="token keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="token operator"&gt;~&lt;/SPAN&gt;mh&lt;SPAN class="token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token function"&gt;find&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;(&lt;/SPAN&gt;key: cols&lt;SPAN class="token punctuation"&gt;[&lt;/SPAN&gt;i&lt;SPAN class="token punctuation"&gt;]&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;then&lt;/SPAN&gt; do&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
            output&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
	    leave&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
         end&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
      end&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;   end&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;

   stop&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
&lt;SPAN class="token procnames"&gt;run&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2019 06:51:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563189#M157843</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-06-03T06:51:15Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563190#M157844</link>
      <description>&lt;P&gt;Is this faster?&lt;/P&gt;
&lt;P&gt;Also note the 4 comments:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
   if 0 then set have;
   array cols {*} $ col1 - col50;
   length codes $8;           %**** 1 - No dot here, it is not a format;  
   if _n_ = 1 then do;
      declare hash mh(dataset: 'mh_codes');
      mh.defineKey('codes');
      mh.defineDone();
      call missing(codes);    %***** 2 - Unneeded;
   end;

   do until(last);
      set have end = last;
      if 0 
        %macro loop; %local i; %do i = 1 %to 50;
          | ~mh.check(key: cols[&amp;amp;i])  %**** 3 - Check() is faster than find ;
        %end; %mend; %loop&lt;BR /&gt;      
       then output;  
   end;

   stop;                 %***** 4 - Unneeded;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2019 22:09:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563190#M157844</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-06-03T22:09:47Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563209#M157856</link>
      <description>&lt;P&gt;Thanks so much. The speed is further improved. The first time run was about 28 mins, and the second time run was about 17 mins.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, do you mind telling me what's the purpose of the slash '|' before '~mh.check()'?&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2019 09:39:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563209#M157856</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-06-03T09:39:19Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563212#M157858</link>
      <description>|  is the OR operator written differently. &lt;BR /&gt;</description>
      <pubDate>Mon, 03 Jun 2019 09:51:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563212#M157858</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-06-03T09:51:59Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563214#M157859</link>
      <description>&lt;P&gt;What I meant is , how does this code work, why OR here, why not IF?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2019 09:58:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563214#M157859</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-06-03T09:58:02Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563400#M157941</link>
      <description>&lt;P&gt;Back at the computer now. I did a quick test and for me, the IF OR method is about twice as slow as the DO&amp;nbsp; loop.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data HAVE MH_CODES(keep=COL25 rename=(COL25=CODES));
  array COL[1:50];
  do I=1 to 1e4;
    do C=1 to 50;
      COL[C]=I*C;
      output HAVE;
      if ranuni(1)&amp;gt;.5 then output MH_CODES;
     end;
  end;
run;
      
data WANT;   *  6.28 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
      declare hash MH(dataset: 'MH_CODES');
      MH.defineKey('CODES');
      MH.defineDone();
      call missing(CODES);
   end;
   do until(LAST);
      set HAVE end = LAST;
      if    %macro loop; %local i; %do i = 1 %to 50;
        ~MH.check(key: COLS[&amp;amp;i])  OR   
            %end; %mend; %loop      
        0 then output;
   end;
run;

data WANT;   *  3.69 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
    declare hash MH(dataset: 'MH_CODES');
    MH.defineKey('CODES');
    MH.defineDone();
    call missing(CODES);
  end;
  do until(LAST);
    set HAVE end = LAST;
    do I = 1 to 50;
      if ~MH.check(key: COLS[I]) then do;
        output;
        leave;
      end;  
    end;  
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I had and extraneous leave statement in my code, which I removed. This might explain your different results. Please check again.&lt;/P&gt;
&lt;P&gt;Using check() should definitely save time though.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2019 22:08:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563400#M157941</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-06-03T22:08:14Z</dc:date>
    </item>
    <item>
      <title>Re: Hash table search for multiple columns in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563415#M157948</link>
      <description>&lt;P&gt;Hi Chris, thanks for the further help on this. I am working on a remote server, so normally during the business hours, the server could be very busy. Applying the codes on my data, the earlier codes with %macro took about 25 mins to complete the subsetting, and the new codes&amp;nbsp; with "IF" statement without %macro took about 38 mins. Offcoz it is also based on the server performance at the moment of running.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, the earlier codes only took about 4 mins to complete the task during off peak hours last night. It is remarkable I would say. Thanks very much for your help. I will accept your latest response as the solution.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Jun 2019 01:33:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-table-search-for-multiple-columns-in-SAS/m-p/563415#M157948</guid>
      <dc:creator>windlove</dc:creator>
      <dc:date>2019-06-04T01:33:53Z</dc:date>
    </item>
  </channel>
</rss>

