Hello,
I'm processing a large amount of data that requires checking if a string is in a list of strings. This is an operation that needs to be performed a lot of times per observation for large 1,000,000+ observation datasets so any slight improvement in efficiency could save a lot of time.
The string to search is about 40 'words' long, so I'm wondering whether it is faster to use an in statement or a hash object (or some other way that is faster would also be great [maybe format?]). If you have any information about how both of these methods scale with the length of the string list and the length of the individual strings within the list would be very helpful.
Some initial testing seems to favor the hash merge slightly, but the results were not conclusive.
Hello,
Your list of strings are columns/variables in your dataset , I assume.
Put the character variables in an array and loop through the array for every observation.
Use find function!
You can use the FIND function in SAS to find the position of the first occurrence of some substring within a string.
data new_data;
set original_data;
first_occurrence = find(variable_name, "string", "i");
run;
The “i” argument tells SAS to ignore the case when searching for the substring.
Koen
Thanks for the response! This could definitely achieve the desired results, but my question was mainly asking about the efficiency of different techniques. Do you believe that this would be significantly faster than using a hash dataset and/or an in statement?
To expand on my initial post, which of the below methods would be fastest? Also, if you could direct me to any literature as to why, that would be amazing.
For context, set will have well over 1,000,000 variables (perhaps 10,000,000) and there are probably 50 variables that need to be checked in a way similar to var1. The list to be searched is also much longer than just ("ABC" "DEF" "GHI") and contains at least 30 strings.
Examples
data want_in_method;
set have;
length flag 3.;
if var1 in ("ABC" "DEF" "GHI") then flag = 1;
else flag = 0;
run;
data want_hash_method;
set have;
length flag 3.;
if _n_ = 1 then do;
declare hash h(dataset:"string_list");
h.DefineKey("var1");
h.DefineDone();
end;
if h.find() = 0 then flag = 1;
else flag = 0;
run;
data want_find_method;
set have;
length flag 3.;
if find("ABC DEF GHI", var1, "i") then flag = 1;
else flag = 0;
run;
It's hard to predict which will be fastest (at least hard for me). It depends a bit on your system (hardware, networking, etc.) and also your data (e.g. how far into the list of values will IN and FIND have to look to find a match for each record?).
Generally the slowest part of a DATA step is the actual I/O (reading and writing of records). So for biggish data like this (1M rows isn't that big, unless you have an insane number of variables) if you want to optimize stuff think about where the data is being read from, and where it's being written to.
All that said, since you have 3 proposed methods, why not run them all on your actual data and time it with options FULLSTIMER? Or if that would take a long time, you can use (obs=100000) or whatever to test on a subset. Of course if you really want to play with this, you could simulate some data, then run a test script multiple times, analyze the results, and write up a user group paper on efficient lookups. Or search lexjansen.com to see what papers are already out there.
Reading through your posts I'm a bit confused. Can you please confirm the following:
If above is true then please note that a single observations must fit into memory and that you might not have enough memory for 10M variables.
Not sure if you would get into an out-of-memory condition or just get really bad performance because of paging. When I tried on my laptop below code run for so long that I cancelled the process. I didn't check in detail but it looked to me like paging occurred.
options msglevel=i fullstimer;
proc options group=memory;
run;
data test;
array vars{10000000} 3;
run;
I'm not sure which of your 3 approaches will execute the fastest. It's just a guess but I'd opt for the IN operator.
If it's about performance then first thing to do is reconsider your data structure. Do you really need that many variables or could you organise your data differently?
I assume that your SAS file is really huge so likely the bottleneck is I/O.
If it's I/O then first thing to do is to reduce file size. With that many variables I'd assume a lot of them are indicator/flag variables with values 0 and 1. If so then ideally create these variable as character with a length of 1.
If these variable need to be numeric then use a length of 3 and compression=binary.
For processing consider to run your code in parallel (SAS/Connect - rsubmit) and process per child process one portion of your source data - set <table>( firstobs=... obs=...)
When @Quentin speaks, people should listen. With 10,000,000 variables to read in, the speed of the string search you are trying to implement will be your last worry, ragardless of whether you use the IN operator or the hash object. In fact, I doubt that with so many variables you will be able to process such a data set in any meaningful way.
That said, out of plain curiosity I've spent a few minutes to emulate you case and then test the IN against the hash:
data data (keep=var:) strings (keep=str) ;
array var $16 VAR1-VAR50 ;
array nn [30] _temporary_ ;
do _i_ = 1 to dim (nn) ;
nn [_i_] + ceil (ranuni(1) * 1e6 * 2 ** _i_) ;
str = put (nn[_i_], 16.-L) ;
length STRINGS $ 1024 ;
strings = catx (" ", strings, quote (trim (str))) ;
output strings ;
end ;
do _n_ = 1 to 1e6 ;
do over var ;
nvar = nn [ceil (ranuni(2) * dim (nn))] ;
if ranuni (3) > 0.005 then nvar = nvar * 2 ;
var = put (nvar, 16.-L) ;
end ;
output data ;
end ;
call symputx ("strings", strings) ;
stop ;
run ;
%put &=strings ;
data _null_ /* flag_inop */ ;
set data ;
array var var: ;
do over var ;
if var in (&strings) then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
data _null_ /* flag_hash */ ;
if _n_ = 1 then do ;
if 0 then set strings ;
dcl hash h (dataset:"strings") ;
h.definekey ("str") ;
h.definedone () ;
end ;
set data ;
array var var: ;
do over var ;
if h.check (key:var) = 0 then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
In my environment, (Linux 64 SAS server) the IN operator ran in 2.39 seconds and the hash - in 3.6. Both methods ultimately use the binary search to find or reject the key, which is why the results don't differ much. With them being so close, it does not really matter which one to opt for.
Note that I did not include any extra variables in DATA. If you really are going to process a data set with 10 million variables, I would do either of the above FLAG computing exercises by first keeping only VAR1-VAR50 on SET input and keeping only FLAG in the output. Then you can create a view merging the latter back with data and thereafter use that view as input. For example, if opting for In operator:
data flag_inop (keep=flag) ;
set data (keep=var1-var50) ;
array var var: ;
do over var ;
if var in (&strings) then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
data data_flag / view = data_flag ;
merge data flag_inop ;
run ;
HTH
Paul Dorfman
Sorry everyone. This was my first post, so there is clearly a lot to learn about how to ask questions. I meant that there are ~10,000,000 observations (not 10 mil variables). There are only 100-200 variables.
I did a little bit of testing (not comprehensive) and the find method seems to be the fastest if the searched list is short, but the hash and in methods (which are comparable in performance) beat out the find method as the string length increases.
From reading your responses, it seems like the best way to improve performance is not to search the strings faster but to reduce the length of the flag variables anyways.
Understood. Well, in this case you already have your answers and code samples.
SAS string operator performance falls off a cliff if the strings it too long so keeping the variables short will help.
What do you mean by a 40 word string? If the average word is even 5 characters that will be a string longer than 200 bytes.
Why do you keep searching the same large dataset over and over? What is changing? Is that the actual data in the dataset is changing over time? Are you are searching for different "words" each time?
Since you have data you can use then you can run your own speed tests. You could just check the times listed in the SAS LOG. Or you could use your own stopwatch. Save the current time (datetime actually) before you start a method and then calculate the difference afterwords.
%let start=%sysfunc(datetime());
*... code to try one of the methods ;
%let runtime=%sysevalf(&start - %sysfunc(datetime()));
This will get you the run time in seconds (with fractions of a second included).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.