Watch this Ask the Expert session to learn more tips and tricks for cleaning data using SAS programming.
Watch the webinar
You will learn how to use SAS character functions to:
Extract and count substrings.
Search for strings.
Concatenate strings.
Remove characters from strings.
If you missed earlier sessions in this series, watch part 1 and part 2 on demand.
The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.
Q&A
Survey systems such as REDCAP frequently assign a length of $500 to incoming character variables. Every pull from the data base (sometimes weekly) results in hundreds of length $500 variables. Is there any way to check on the fly if that length is truly needed short of recreating the length / format / input statements?
My memory of REDCAP is that it can output both Excel files and SAS files. One suggestion could be to output to an Excel file instead of a SAS file, then read in the Excel file with PROC IMPORT. You want to specify DBMS=XLSX on the PROC IMPORT statement and specify an additional statement of GUESSINGROWS=max. The latter tells SAS to read through all the data to determine the proper variable length. If this makes your program run too slowly, you could specify a smaller number of rows to scan. The SAS Excel import is pretty good at figuring out the max length needed for text variables when using GUESSINGROWS.
If there is nothing you can do to change your existing data pull, there may still be value in creating another dataset with appropriate lengths. In this case, you would need to manually specify the lengths of variables using a LENGTH statement.
You could figure out the lengths to set by using this code extended to your number of variables:
proc sql;
select max(length(textvar1)),
max(length(textvar2)),
max(length(textvar3))
from dataname;
quit;
Doesn’t index do the same thing as FIND?
The FIND, FINDC, and FINDW functions do the same thing as the INDEX, INDEXC, and INDEXW functions, respectively, but have some additional functionality. For that reason, I covered only the FIND functions in the webinar.
Check out the documentation for the FIND and INDEX functions. Both have required arguments of the string to search and the substring to find, but the FIND function additionally allows specification of the start position and modifiers.
Is there a way to search right to left using FIND, but where it will start from the end of the string, not the absolute value of the start?
Yes. To do this, specify a start value that is greater than the maximum length of your string, say -9999 or something like that.
I don't understand FINDW. It does not catch the substring in the first example, while it works in the 2nd example:
data _null_;
some2="some ";
some1="some another";
findit=findw(some1, some2);
putlog "WARNING: " findit=;
some1="some ";
findit=findw(some1, some2);
putlog "WARNING: " findit=;
run;
The FINDW function searches for words within a string, where words are separated by delimiters. SAS does not consider the delimiters as parts of the words in a string. So, for example, “some another” is the word “some”, then a delimiter of a space, then the word “another”. The words “some “ (space after “some”) or “ another” (space before “another”) would not be found, since the one space is the delimiter separating words.
A word can contain delimiters, though, and repeated delimiters can be considered part of a word on either side of them.
For example, “some another” (two spaces in between “some” and “another”) could be the words “some " (space after “some”) and “another” or the words “some” and “ another” (space before “another”).
In order to find your string SOME2 (“some” followed by one space), you would need to search within a string that has two consecutive spaces in a row or has a space followed by another delimiter.
Would -1 and -99 do the same thing? Start at the end of a string?
In general, no. SAS starts at the absolute value of the start value. Then, if the start value is positive SAS reads left to right, and if negative, SAS reads right to left. Using a start value of -1 tells SAS to start at the first character and move left. In the example, the string is shorter than 99 characters, so using a start value of -99 tells SAS to start at the end of the string and move left.
When cleaning addresses, is it more efficient to use the char functions you presented today or the regular expressions that you presented in part 2?
Great question. The perl regular expressions from part 2 will allow you to find much more complicated patterns than the character functions in this part 3. In my opinion, you sacrifice some readability, though. I recommend you look at this paper that compares perl regular expressions to several character functions we covered in part 3:
Express Yourself! Regular Expressions vs SAS Text String Functions
Is there a function that counts the words in a string?
Yes. The COUNTW function counts the number of words in a string. This is a function I would have loved to cover if we had more time.
Recommended Resources
SAS® Programming 2: Data Manipulation Techniques
SAS Functions by Example on RedShelf
SAS Functions by Example on Amazon
Ron Cody SAS Author Page
SAS OnDemand for Academics Software for Students, Educators, and Independent Learners
Please see additional resources in the attached slide deck.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.
... View more