topic Re: Hash table search for multiple columns in SAS in SAS Programming

Hash table search for multiple columns in SAS

windlove — Fri, 31 May 2019 04:36:43 GMT

I have a data with about 40M rows. There are 50 columns that I would like to extract strings from them. I have used normal data step with array to perform the task, but it took over 2 hours to finish the extraction.

I know how to use hash table in SAS for simple join or subsetting by specifying a lookup table first. However, I prefer to use regular expression to do the extraction here. The current extraction is using the codes something like below.

How can I conduct hash table search in those 50 columns in SAS without an lookup table?

data want;
   set have;
   array cols {*} $ col1 - col50;

   do i = 1 to dim(cols)
      if prxmatch('/F[0-9].*[123]/', cols[i])
         then output;
   end;
run;

Re: Hash table search for multiple columns in SAS

ChrisNZ — Fri, 31 May 2019 05:01:37 GMT

Hash tables only use direct key equality matches, not fuzzy matches like a regular expression.

I am unsure that iterating would be faster than iterating through an array.

I suspect your run time is CPU bound. Is it? Can you show the log report of the data step run?

Regular expressions are very costly in CPU resources, and if the execution time is limited by the CPU, you should try to apply the regex only after a simpler test.

Something like

if index(cols[I],'F') then
  if prxmatch('/F[0-9].*[123]/', cols[I]) then
   output;

Alter the test to match your data.

You can also trim the string to speed up the match if you expect spaces at the end.

Re: Hash table search for multiple columns in SAS

windlove — Fri, 31 May 2019 05:18:55 GMT

Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression.

I don't have the "index" line as you suggested, I will try to test it and report back later.

Re: Hash table search for multiple columns in SAS

ChrisNZ — Fri, 31 May 2019 05:28:57 GMT

Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression.

Read my reply again, I think you missed my point.

Re: Hash table search for multiple columns in SAS

Astounding — Fri, 31 May 2019 07:16:14 GMT

Could you match on multiple variables? If so, would you want to output the same observation multiple times? If not, you could cut the time roughly in half by ending the DO loop once a match has been found.

Re: Hash table search for multiple columns in SAS

Ksharp — Fri, 31 May 2019 13:40:58 GMT

Using Perl regular expression is not efficient way.

The most efficient way is using native SAS function:

if cols[i] in:  ('F0' 'F1' ..... 'F9') and
   substr( cols[i],length(cols[i])-1 ) in ('1' '2' '3')
         then output;

Re: Hash table search for multiple columns in SAS

FreelanceReinh — Fri, 31 May 2019 14:49:31 GMT

Hi @windlove,

I agree with @Ksharp that simpler SAS functions perform better than PRXMATCH in this case. However, my interpretation of your regular expression suggested a somewhat more involved algorithm. Still, it was about 3.6 times faster (17 s vs. 61 s) than the regex approach on my test dataset with 4 million observations (see log under the spoiler). Note that the run time will depend on how many times (i.e. 0, ..., 50) the pattern is found per observation. In my test dataset this number was about 5.4 on average. If this was comparable to your dataset (which does not seem to be the case, given your run time), your run times should be about 10 times higher.

/* Create test data for demonstration (only 4 million obs.) */

data have;
call streaminit(27182818);
array col[50] $;
do id=1 to 4e6;
  do _n_=1 to dim(col);
    col[_n_]=put(rand('uniform',2**32),hex8.);
  end;
  output;
end;
run;

/* Original approach using PRXMATCH */

data want0;
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
end;
run;

/* Alternative approach using FINDC function */

data want(drop=_:);
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  _pos=findc(cols[i],'F');
  do while(0<_pos<7);
    _pos+1;
    _c=char(cols[i],_pos);
    if _c='F' then continue;
    else if '0'<=_c<='9' then do;
      if findc(cols[i],'123',_pos+1) then output;
      leave;
    end;
    else do;
      _pos=findc(cols[i],'F',_pos+1);
      continue;
    end;
  end;
end;
run;

Of course, WANT and WANT0 are identical.

Edit: My test dataset contains strings of length 8 in col1-col50. If your variables are longer (or shorter), the upper bound in the WHILE condition 0<_pos<7 should be the length minus 1 (instead of 7) or, to be more flexible, use the actual length minus 1:

do while(0<_pos<length(cols[i])-1);

Log:

1    /* Create test data for demonstration (only 4 million obs.) */
2
3    data have;
4    call streaminit(27182818);
5    array col[50] $;
6    do id=1 to 4e6;
7      do _n_=1 to dim(col);
8        col[_n_]=put(rand('uniform',2**32),hex8.);
9      end;
10     output;
11   end;
12   run;

NOTE: The data set WORK.HAVE has 4000000 observations and 51 variables.
NOTE: DATA statement used (Total process time):
      real time           14.70 seconds
      cpu time            14.68 seconds


13
14   /* Original approach using PRXMATCH */
15
16   data want0;
17   set have;
18   array cols[*] $ col1-col50;
19   do i=1 to dim(cols);
20     if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
21   end;
22   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT0 has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           1:00.98
      cpu time            1:00.96


23
24   /* Alternative approach using FINDC function */
25
26   data want(drop=_:);
27   set have;
28   array cols[*] $ col1-col50;
29   do i=1 to dim(cols);
30     _pos=findc(cols[i],'F');
31     do while(0<_pos<7);
32       _pos+1;
33       _c=char(cols[i],_pos);
34       if _c='F' then continue;
35       else if '0'<=_c<='9' then do;
36         if findc(cols[i],'123',_pos+1) then output;
37         leave;
38       end;
39       else do;
40         _pos=findc(cols[i],'F',_pos+1);
41         continue;
42       end;
43     end;
44   end;
45   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           17.06 seconds
      cpu time            17.06 seconds

Re: Hash table search for multiple columns in SAS

Tom — Fri, 31 May 2019 16:41:25 GMT

Why do you have 50 columns to check?

If they have the same type of information why not restructure those 50 variables into 50 (up to 50) observations instead?

Depending on other features of you data you might actually find that the vertical structure takes less space.

Do really want to make replicates of the whole observation when the pattern is matched in multiple variables? You could end up with output that is 50 times the size of the input.

Re: Hash table search for multiple columns in SAS

windlove — Sat, 01 Jun 2019 11:25:39 GMT

Thanks for all the responses. It's not only the F codes I want to search, but the ones below. If without using the regular expression, it would be quite messy. And for all 50 columns, I want all the columns returned as long as the column contains any of the codes, rather than return at first match.

F.*|G30.*[0189]|G47.[01289]|O99.*[3]|
R44.*[0-38]|R45.*[014]|R48.*[0128]|
Z00.*4|Z03.*2|Z04.*6|Z09.*3|Z13.*3|
Z50.*[23]|Z54.*3|Z61.*9|Z63.*[189]|
Z65.*[89]|Z71.*[45]|Z76.*0|
Q90.*[0-29]|Q91.*[0-7]|
Q93.*|Q99.*2|P04.*3|Q86.*[0128]|
Q87.*[0-358]|Q89.*8|
X[67][0-9].*|X8[0-4].*|R45.*81|
K70.*|Y9[01].*|T51.*|X45.*|Y15.*|
R78.*0|Z72.*1|T40.*7|T43.*6[0129]

Re: Hash table search for multiple columns in SAS

ChrisNZ — Sat, 01 Jun 2019 11:36:58 GMT

Are these the expressions you use?

Re: Hash table search for multiple columns in SAS

FreelanceReinh — Sat, 01 Jun 2019 12:58:49 GMT

The very first component expression, F.*, and several others end with .* -- this is redundant. I suspect that you mean "codes that start with F" and not "codes that contain an F". These criteria might be equivalent for your codes in terms of search results (e.g. if all codes are single letters followed by a sequence of digits), but not performance wise. The appropriate regex would be ^F, but a traditional SAS criterion such as cols[i]=:'F' would still perform much better.

Re: Hash table search for multiple columns in SAS

Astounding — Sat, 01 Jun 2019 13:59:48 GMT

When you output an observation, you are outputting all the columns (both those that match and those that don't match). If you find a second match, and output the observation again, you are outputting all the same columns a second time. Did you want to change that result?

Re: Hash table search for multiple columns in SAS

Patrick — Sun, 02 Jun 2019 02:19:56 GMT

@windlove

The following program should give you improved performance while keeping the coding very manageable. This is a compromise and more performance improvements would be possible but I believe would potentially require a lot of extra coding.

I've made the assumption that your patterns always must match starting from source string position 1.

data have;
  col2='F09324823O'; OUTPUT;
  col2='G30XXXXXXX'; OUTPUT;
  col2='G30XXXX9XX'; OUTPUT;
  col2='O99XXX4XX5'; OUTPUT;
  col2='O99XXX4X3X'; OUTPUT;
run;

data prxPattern;
  infile datalines truncover dlm=',';
  input prxKey :$1. prxPattern :$40.;
  datalines4;
F,/^F/
G,/^G30.*[0189]/
G,/^G47.[01289]/
O,/^O99.*[3]/
R,/^R44.*[0-38]/
R,/^R45.*[014]/
R,/^R48.*[0128]/
;;;;

/* &max_len: max length of variables used in array */
%let max_len=;
proc sql noprint;
  select max(length) into :max_len trimmed
  from dictionary.columns
  where libname='WORK' and memname='HAVE'
        and upcase(name) like 'COL%'
  ;
quit;
%put &=max_len;


data want(keep=id prxPattern var_name var_value compress=yes);

  set have(keep=col:);
  array cols {*} $ col:;
  id=_n_; /* ideally don't create ID but add an existing primary key variable to the keep statements */
  length var_name $32 ; 
  length var_value $&max_len; /* &max_len: max length of variables used in array */

  if _n_=1 then
    do;
      if 0 then set prxPattern;
      length prxID 3;
      dcl hash h1(multidata:'Y', hashexp:5);
      h1.defineKey('prxKey');
      h1.defineData('prxID','prxPattern');
      h1.defineDone();
      /* compile RegEx and load pointer (prxID) into hash */
      do while(not last);
        set prxPattern end=last;
        prxID=prxparse(prxPattern);
        h1.add();
      end;
    end;

  do _i=1 to dim(cols);
    prxKey=substrn(cols[_i],1,1);
    h1.reset_dup();
    do while(h1.do_over()=0);
      if prxmatch(prxID,trim(cols[_i])) then
        do;
          var_name=vname(cols[_i]);
          var_value=cols[_i];
          output;
          leave;
        end;
    end;
  end;

run;

N.B: The assumption that the pattern must match from position one not only allowed to define the "key" and though reduce the number of pattern tests per string, it also allowed to amend the RegEx with a ^ so that the search doesn't look for a pattern match in the middle of a string. These two measures should improve performance quite a bit as compared to your initial code.

Re: Hash table search for multiple columns in SAS

windlove — Mon, 03 Jun 2019 06:51:15 GMT

Thanks for all the responses. First, I think I made a mistake in my original code, which should add "leave" after "output" inside "do" loop, so that it won't return duplicated rows.

Second, I have changed my strategy. I isolated those 50 columns from the main data, and transposed them into a single variable (not using proc transpose due to missing values for each row, otherwise proc transpose immediately gave insufficient memory error). Then I de-duplicated this variable and extracted the codes I want to form a look-up table. Then I defined hash object to subset the data that contains those codes.

Now the subsetting only takes less than 40 mins with hash table. And the previous steps only need to be run once. If any code needs to be changed, I only need to extract the codes I want from that single variable, which only takes a second + hash table subsetting.

However, within hash table, I still use "do" loop. If anyone can suggest something else to further improve the performance, that would be great. Here are the hash table subsetting below.

data want;

   if 0 then set have;

   length codes $8.;
		
   if _n_ = 1 then do;
   declare hash mh(dataset: 'mh_codes');
		mh.defineKey('codes');
		mh.defineDone();
	   call missing(codes);
   end;

   do until(last);
      set have end = last;
      array cols {*} $ col1 - col50;
      do i = 1 to dim(cols);
         if ~mh.find(key: cols[i]) then do;
            output;
	    leave;
         end;
      end;   end;

   stop;
run;

Re: Hash table search for multiple columns in SAS

ChrisNZ — Mon, 03 Jun 2019 22:09:47 GMT

Is this faster?

Also note the 4 comments:

data want;
   if 0 then set have;
   array cols {*} $ col1 - col50;
   length codes $8;           %**** 1 - No dot here, it is not a format;  
   if _n_ = 1 then do;
      declare hash mh(dataset: 'mh_codes');
      mh.defineKey('codes');
      mh.defineDone();
      call missing(codes);    %***** 2 - Unneeded;
   end;

   do until(last);
      set have end = last;
      if 0 
        %macro loop; %local i; %do i = 1 %to 50;
          | ~mh.check(key: cols[&i])  %**** 3 - Check() is faster than find ;
        %end; %mend; %loop
      
       then output;  
   end;

   stop;                 %***** 4 - Unneeded;
run;

Re: Hash table search for multiple columns in SAS

windlove — Mon, 03 Jun 2019 09:39:19 GMT

Thanks so much. The speed is further improved. The first time run was about 28 mins, and the second time run was about 17 mins.

However, do you mind telling me what's the purpose of the slash '|' before '~mh.check()'?

Re: Hash table search for multiple columns in SAS

ChrisNZ — Mon, 03 Jun 2019 09:51:59 GMT

| is the OR operator written differently.

Re: Hash table search for multiple columns in SAS

windlove — Mon, 03 Jun 2019 09:58:02 GMT

What I meant is , how does this code work, why OR here, why not IF?

Re: Hash table search for multiple columns in SAS

ChrisNZ — Mon, 03 Jun 2019 22:08:14 GMT

Back at the computer now. I did a quick test and for me, the IF OR method is about twice as slow as the DO loop.

data HAVE MH_CODES(keep=COL25 rename=(COL25=CODES));
  array COL[1:50];
  do I=1 to 1e4;
    do C=1 to 50;
      COL[C]=I*C;
      output HAVE;
      if ranuni(1)>.5 then output MH_CODES;
     end;
  end;
run;
      
data WANT;   *  6.28 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
      declare hash MH(dataset: 'MH_CODES');
      MH.defineKey('CODES');
      MH.defineDone();
      call missing(CODES);
   end;
   do until(LAST);
      set HAVE end = LAST;
      if    %macro loop; %local i; %do i = 1 %to 50;
        ~MH.check(key: COLS[&i])  OR   
            %end; %mend; %loop      
        0 then output;
   end;
run;

data WANT;   *  3.69 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
    declare hash MH(dataset: 'MH_CODES');
    MH.defineKey('CODES');
    MH.defineDone();
    call missing(CODES);
  end;
  do until(LAST);
    set HAVE end = LAST;
    do I = 1 to 50;
      if ~MH.check(key: COLS[I]) then do;
        output;
        leave;
      end;  
    end;  
  end;
run;

I had and extraneous leave statement in my code, which I removed. This might explain your different results. Please check again.

Using check() should definitely save time though.

Re: Hash table search for multiple columns in SAS

windlove — Tue, 04 Jun 2019 01:33:53 GMT

Hi Chris, thanks for the further help on this. I am working on a remote server, so normally during the business hours, the server could be very busy. Applying the codes on my data, the earlier codes with %macro took about 25 mins to complete the subsetting, and the new codes with "IF" statement without %macro took about 38 mins. Offcoz it is also based on the server performance at the moment of running.

However, the earlier codes only took about 4 mins to complete the task during off peak hours last night. It is remarkable I would say. Thanks very much for your help. I will accept your latest response as the solution.