BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
windlove
Fluorite | Level 6

I have a data with about 40M rows. There are 50 columns that I would like to extract strings from them. I have used normal data step with array to perform the task, but it took over 2 hours to finish the extraction.

 

I know how to use hash table in SAS for simple join or subsetting by specifying a lookup table first. However, I prefer to use regular expression to do the extraction here. The current extraction is using the codes something like below.

 

How can I conduct hash table search in those 50 columns in SAS without an lookup table?

 

data want;
   set have;
   array cols {*} $ col1 - col50;

   do i = 1 to dim(cols)
      if prxmatch('/F[0-9].*[123]/', cols[i])
         then output;
   end;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Back at the computer now. I did a quick test and for me, the IF OR method is about twice as slow as the DO  loop.

 

data HAVE MH_CODES(keep=COL25 rename=(COL25=CODES));
  array COL[1:50];
  do I=1 to 1e4;
    do C=1 to 50;
      COL[C]=I*C;
      output HAVE;
      if ranuni(1)>.5 then output MH_CODES;
     end;
  end;
run;
      
data WANT;   *  6.28 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
      declare hash MH(dataset: 'MH_CODES');
      MH.defineKey('CODES');
      MH.defineDone();
      call missing(CODES);
   end;
   do until(LAST);
      set HAVE end = LAST;
      if    %macro loop; %local i; %do i = 1 %to 50;
        ~MH.check(key: COLS[&i])  OR   
            %end; %mend; %loop      
        0 then output;
   end;
run;

data WANT;   *  3.69 seconds;
  array COLS [*] COL1 - COL50;
  if _N_ = 1 then do;
    declare hash MH(dataset: 'MH_CODES');
    MH.defineKey('CODES');
    MH.defineDone();
    call missing(CODES);
  end;
  do until(LAST);
    set HAVE end = LAST;
    do I = 1 to 50;
      if ~MH.check(key: COLS[I]) then do;
        output;
        leave;
      end;  
    end;  
  end;
run;

 

 

I had and extraneous leave statement in my code, which I removed. This might explain your different results. Please check again.

Using check() should definitely save time though.

 

 

 

View solution in original post

20 REPLIES 20
ChrisNZ
Tourmaline | Level 20

Hash tables only use direct key equality matches, not fuzzy matches like a regular expression.

I am unsure that iterating would be faster than iterating through an array.

 

I suspect your run time is CPU bound. Is it? Can you show the log report of the data step run?

Regular expressions are very costly in CPU resources, and if the execution time is limited by the CPU, you should try to apply the regex only after a simpler test.

Something like 

if index(cols[I],'F') then
  if prxmatch('/F[0-9].*[123]/', cols[I]) then
   output;

Alter the test to match your data.

You can also trim the string to speed up the match if you expect spaces at the end.

 

 

 

 

 

windlove
Fluorite | Level 6

Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression. 

 

I don't have the "index" line as you suggested, I will try to test it and report back later. 

ChrisNZ
Tourmaline | Level 20

Thanks for the reply. The issue is it is not only the F: code I am going to extract. Without using regular expression, I have to write down about 200 codes including their variations. So regular expression would be a necessary and easiest step in terms of string expression. 

Read my reply again, I think you missed my point.

 

Astounding
PROC Star
Could you match on multiple variables? If so, would you want to output the same observation multiple times? If not, you could cut the time roughly in half by ending the DO loop once a match has been found.
Ksharp
Super User

Using Perl  regular expression is not efficient way.

The most efficient way is using native SAS function:

 

if cols[i] in:  ('F0' 'F1' ..... 'F9') and
substr( cols[i],length(cols[i])-1 ) in ('1' '2' '3') then output;
FreelanceReinh
Jade | Level 19

Hi @windlove,

 

I agree with @Ksharp that simpler SAS functions perform better than PRXMATCH in this case. However, my interpretation of your regular expression suggested a somewhat more involved algorithm. Still, it was about 3.6 times faster (17 s vs. 61 s) than the regex approach on my test dataset with 4 million observations (see log under the spoiler). Note that the run time will depend on how many times (i.e. 0, ..., 50) the pattern is found per observation. In my test dataset this number was about 5.4 on average. If this was comparable to your dataset (which does not seem to be the case, given your run time), your run times should be about 10 times higher.

 

/* Create test data for demonstration (only 4 million obs.) */

data have;
call streaminit(27182818);
array col[50] $;
do id=1 to 4e6;
  do _n_=1 to dim(col);
    col[_n_]=put(rand('uniform',2**32),hex8.);
  end;
  output;
end;
run;

/* Original approach using PRXMATCH */

data want0;
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
end;
run;

/* Alternative approach using FINDC function */

data want(drop=_:);
set have;
array cols[*] $ col1-col50;
do i=1 to dim(cols);
  _pos=findc(cols[i],'F');
  do while(0<_pos<7);
    _pos+1;
    _c=char(cols[i],_pos);
    if _c='F' then continue;
    else if '0'<=_c<='9' then do;
      if findc(cols[i],'123',_pos+1) then output;
      leave;
    end;
    else do;
      _pos=findc(cols[i],'F',_pos+1);
      continue;
    end;
  end;
end;
run;

Of course, WANT and WANT0 are identical.

 

Edit: My test dataset contains strings of length 8 in col1-col50. If your variables are longer (or shorter), the upper bound in the WHILE condition 0<_pos<7 should be the length minus 1 (instead of 7) or, to be more flexible, use the actual length minus 1:

do while(0<_pos<length(cols[i])-1);

Log:

 

Spoiler
1    /* Create test data for demonstration (only 4 million obs.) */
2
3    data have;
4    call streaminit(27182818);
5    array col[50] $;
6    do id=1 to 4e6;
7      do _n_=1 to dim(col);
8        col[_n_]=put(rand('uniform',2**32),hex8.);
9      end;
10     output;
11   end;
12   run;

NOTE: The data set WORK.HAVE has 4000000 observations and 51 variables.
NOTE: DATA statement used (Total process time):
      real time           14.70 seconds
      cpu time            14.68 seconds


13
14   /* Original approach using PRXMATCH */
15
16   data want0;
17   set have;
18   array cols[*] $ col1-col50;
19   do i=1 to dim(cols);
20     if prxmatch('/F[0-9].*[123]/o', cols[i]) then output;
21   end;
22   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT0 has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           1:00.98
      cpu time            1:00.96


23
24   /* Alternative approach using FINDC function */
25
26   data want(drop=_:);
27   set have;
28   array cols[*] $ col1-col50;
29   do i=1 to dim(cols);
30     _pos=findc(cols[i],'F');
31     do while(0<_pos<7);
32       _pos+1;
33       _c=char(cols[i],_pos);
34       if _c='F' then continue;
35       else if '0'<=_c<='9' then do;
36         if findc(cols[i],'123',_pos+1) then output;
37         leave;
38       end;
39       else do;
40         _pos=findc(cols[i],'F',_pos+1);
41         continue;
42       end;
43     end;
44   end;
45   run;

NOTE: There were 4000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 21774150 observations and 52 variables.
NOTE: DATA statement used (Total process time):
      real time           17.06 seconds
      cpu time            17.06 seconds

 

windlove
Fluorite | Level 6

Thanks for all the responses. It's not only the F codes I want to search, but the ones below. If without using the regular expression, it would be quite messy. And for all 50 columns, I want all the columns returned as long as the column contains any of the codes, rather than return at first match. 

 

F.*|G30.*[0189]|G47.[01289]|O99.*[3]|
R44.*[0-38]|R45.*[014]|R48.*[0128]|
Z00.*4|Z03.*2|Z04.*6|Z09.*3|Z13.*3|
Z50.*[23]|Z54.*3|Z61.*9|Z63.*[189]|
Z65.*[89]|Z71.*[45]|Z76.*0|
Q90.*[0-29]|Q91.*[0-7]|
Q93.*|Q99.*2|P04.*3|Q86.*[0128]|
Q87.*[0-358]|Q89.*8|
X[67][0-9].*|X8[0-4].*|R45.*81|
K70.*|Y9[01].*|T51.*|X45.*|Y15.*|
R78.*0|Z72.*1|T40.*7|T43.*6[0129]

FreelanceReinh
Jade | Level 19

The very first component expression, F.*, and several others end with .* -- this is redundant. I suspect that you mean "codes that start with F" and not "codes that contain an F". These criteria might be equivalent for your codes in terms of search results (e.g. if all codes are single letters followed by a sequence of digits), but not performance wise. The appropriate regex would be ^F, but a traditional SAS criterion such as cols[i]=:'F' would still perform much better.

Astounding
PROC Star

When you output an observation, you are outputting all the columns (both those that match and those that don't match).  If you find a second match, and output the observation again, you are outputting all the same columns a second time.  Did you want to change that result? 

Patrick
Opal | Level 21

@windlove 

The following program should give you improved performance while keeping the coding very manageable. This is a compromise and more performance improvements would be possible but I believe would potentially require a lot of extra coding.

I've made the assumption that your patterns always must match starting from source string position 1.

data have;
  col2='F09324823O'; OUTPUT;
  col2='G30XXXXXXX'; OUTPUT;
  col2='G30XXXX9XX'; OUTPUT;
  col2='O99XXX4XX5'; OUTPUT;
  col2='O99XXX4X3X'; OUTPUT;
run;

data prxPattern;
  infile datalines truncover dlm=',';
  input prxKey :$1. prxPattern :$40.;
  datalines4;
F,/^F/
G,/^G30.*[0189]/
G,/^G47.[01289]/
O,/^O99.*[3]/
R,/^R44.*[0-38]/
R,/^R45.*[014]/
R,/^R48.*[0128]/
;;;;

/* &max_len: max length of variables used in array */
%let max_len=;
proc sql noprint;
  select max(length) into :max_len trimmed
  from dictionary.columns
  where libname='WORK' and memname='HAVE'
        and upcase(name) like 'COL%'
  ;
quit;
%put &=max_len;


data want(keep=id prxPattern var_name var_value compress=yes);

  set have(keep=col:);
  array cols {*} $ col:;
  id=_n_; /* ideally don't create ID but add an existing primary key variable to the keep statements */
  length var_name $32 ; 
  length var_value $&max_len; /* &max_len: max length of variables used in array */

  if _n_=1 then
    do;
      if 0 then set prxPattern;
      length prxID 3;
      dcl hash h1(multidata:'Y', hashexp:5);
      h1.defineKey('prxKey');
      h1.defineData('prxID','prxPattern');
      h1.defineDone();
      /* compile RegEx and load pointer (prxID) into hash */
      do while(not last);
        set prxPattern end=last;
        prxID=prxparse(prxPattern);
        h1.add();
      end;
    end;

  do _i=1 to dim(cols);
    prxKey=substrn(cols[_i],1,1);
    h1.reset_dup();
    do while(h1.do_over()=0);
      if prxmatch(prxID,trim(cols[_i])) then
        do;
          var_name=vname(cols[_i]);
          var_value=cols[_i];
          output;
          leave;
        end;
    end;
  end;

run;

 N.B: The assumption that the pattern must match from position one not only allowed to define the "key" and though reduce the number of pattern tests per string, it also allowed to amend the RegEx with a ^ so that the search doesn't look for a pattern match in the middle of a string. These two measures should improve performance quite a bit as compared to your initial code.

Tom
Super User Tom
Super User

Why do you have 50 columns to check? 

If they have the same type of information why not restructure those 50 variables into 50 (up to 50) observations instead?

Depending on other features of you data you might actually find that the vertical structure takes less space.

 

Do really want to make replicates of the whole observation when the pattern is matched in multiple variables?  You could end up with output that is 50 times the size of the input.

windlove
Fluorite | Level 6

Thanks for all the responses.  First, I think I made a mistake in my original code, which should add "leave" after "output" inside "do" loop, so that it won't return duplicated rows. 

 

Second, I have changed my strategy. I isolated those 50 columns from the main data, and transposed them into a single variable (not using proc transpose due to missing values for each row, otherwise proc transpose immediately gave insufficient memory error). Then I de-duplicated this variable and extracted the codes I want to form a look-up table.  Then I defined hash object to subset the data that contains those codes. 

 

Now the subsetting only takes less than 40 mins with hash table. And the previous steps only need to be run once. If any code needs to be changed, I only need to extract the codes I want from that single variable, which only takes a second + hash table subsetting. 

 

However, within hash table, I still use "do" loop. If anyone can suggest something else to further improve the performance, that would be great. Here are the hash table subsetting below. 

 

 

data want;

   if 0 then set have;

   length codes $8.;
		
   if _n_ = 1 then do;
   declare hash mh(dataset: 'mh_codes');
		mh.defineKey('codes');
		mh.defineDone();
	   call missing(codes);
   end;

   do until(last);
      set have end = last;
      array cols {*} $ col1 - col50;
      do i = 1 to dim(cols);
         if ~mh.find(key: cols[i]) then do;
            output;
	    leave;
         end;
      end;   end;

   stop;
run;

 

ChrisNZ
Tourmaline | Level 20

Is this faster?

Also note the 4 comments:

data want;
   if 0 then set have;
   array cols {*} $ col1 - col50;
   length codes $8;           %**** 1 - No dot here, it is not a format;  
   if _n_ = 1 then do;
      declare hash mh(dataset: 'mh_codes');
      mh.defineKey('codes');
      mh.defineDone();
      call missing(codes);    %***** 2 - Unneeded;
   end;

   do until(last);
      set have end = last;
      if 0 
        %macro loop; %local i; %do i = 1 %to 50;
          | ~mh.check(key: cols[&i])  %**** 3 - Check() is faster than find ;
        %end; %mend; %loop
then output; end; stop; %***** 4 - Unneeded; run;

 

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 4184 views
  • 1 like
  • 7 in conversation