Ask the Expert

SAS expertise delivered to your desktop -- on-demand and free!
BookmarkSubscribeRSS Feed

How Do I Clean My Data Using SAS Programming? Part 3 Q&A, Slides, and On-Demand Recording

Started ‎03-25-2025 by
Modified ‎03-26-2025 by
Views 968

Watch this Ask the Expert session to learn more tips and tricks for cleaning data using SAS programming. 

 

Watch the webinar

 

You will learn how to use SAS character functions to:

  • Extract and count substrings.
  • Search for strings.
  • Concatenate strings.
  • Remove characters from strings.

If you missed earlier sessions in this series, watch part 1 and part 2 on demand.

 

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

 

Q&A

 

Survey systems such as REDCAP frequently assign a length of $500 to incoming character variables. Every pull from the data base (sometimes weekly) results in hundreds of length $500 variables. Is there any way to check on the fly if that length is truly needed short of recreating the length / format / input statements?

 

My memory of REDCAP is that it can output both Excel files and SAS files. One suggestion could be to output to an Excel file instead of a SAS file, then read in the Excel file with PROC IMPORT. You want to specify DBMS=XLSX on the PROC IMPORT statement and specify an additional statement of GUESSINGROWS=max. The latter tells SAS to read through all the data to determine the proper variable length. If this makes your program run too slowly, you could specify a smaller number of rows to scan. The SAS Excel import is pretty good at figuring out the max length needed for text variables when using GUESSINGROWS.

 

If there is nothing you can do to change your existing data pull, there may still be value in creating another dataset with appropriate lengths. In this case, you would need to manually specify the lengths of variables using a LENGTH statement.

 

You could figure out the lengths to set by using this code extended to your number of variables:

 

proc sql;
select max(length(textvar1)),
       max(length(textvar2)),
       max(length(textvar3))
    from dataname;
quit;

 

Doesn’t index do the same thing as FIND?

 

The FIND, FINDC, and FINDW functions do the same thing as the INDEX, INDEXC, and INDEXW functions, respectively, but have some additional functionality. For that reason, I covered only the FIND functions in the webinar.

 

Check out the documentation for the FIND and INDEX functions. Both have required arguments of the string to search and the substring to find, but the FIND function additionally allows specification of the start position and modifiers.

 

Is there a way to search right to left using FIND, but where it will start from the end of the string, not the absolute value of the start?

 

Yes. To do this, specify a start value that is greater than the maximum length of your string, say -9999 or something like that.

 

I don't understand FINDW. It does not catch the substring in the first example, while it works in the 2nd example:

 

data _null_;
some2="some ";

some1="some another";
findit=findw(some1, some2);
putlog "WARNING: " findit=;

some1="some       ";
findit=findw(some1, some2);
putlog "WARNING: " findit=;

run;

 

The FINDW function searches for words within a string, where words are separated by delimiters. SAS does not consider the delimiters as parts of the words in a string. So, for example, “some another” is the word “some”, then a delimiter of a space, then the word “another”. The words “some “ (space after “some”) or “ another” (space before “another”) would not be found, since the one space is the delimiter separating words.

 

A word can contain delimiters, though, and repeated delimiters can be considered part of a word on either side of them.

 

For example, “some  another” (two spaces in between “some” and “another”) could be the words “some " (space after “some”) and “another” or the words “some” and “ another” (space before “another”).

 

In order to find your string SOME2 (“some” followed by one space), you would need to search within a string that has two consecutive spaces in a row or has a space followed by another delimiter.

 

Would -1 and -99 do the same thing? Start at the end of a string?

In general, no. SAS starts at the absolute value of the start value. Then, if the start value is positive SAS reads left to right, and if negative, SAS reads right to left. Using a start value of -1 tells SAS to start at the first character and move left. In the example, the string is shorter than 99 characters, so using a start value of -99 tells SAS to start at the end of the string and move left.

 

When cleaning addresses, is it more efficient to use the char functions you presented today or the regular expressions that you presented in part 2?

 

Great question. The perl regular expressions from part 2 will allow you to find much more complicated patterns than the character functions in this part 3. In my opinion, you sacrifice some readability, though. I recommend you look at this paper that compares perl regular expressions to several character functions we covered in part 3:

 

Express Yourself! Regular Expressions vs SAS Text String Functions

 

Is there a function that counts the words in a string?

 

Yes. The COUNTW function counts the number of words in a string. This is a function I would have loved to cover if we had more time.

 

Recommended Resources

 

SAS® Programming 2: Data Manipulation Techniques

SAS Functions by Example on RedShelf

SAS Functions by Example on Amazon

Ron Cody SAS Author Page

SAS OnDemand for Academics Software for Students, Educators, and Independent Learners

 

Please see additional resources in the attached slide deck.

 

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Comments

Are you able to break this down further and explain in further detail why the output is "AB"?
SUBSTRN("ABCDE", -1, 4), equivalent of SUBSTR("ABCDE", 1, 2).

Hello @JLF5806! The syntax of the both the SUBSTR and SUBSTRN functions is:

 

SUBSTR (string, start <,length>)

In both cases, the function reads the string left to right, starting at the position specified by start and returning a string of the specified length. The two functions differ in how they handle zero or negative start and length values.

 

Your second example of SUBSTR("ABCDE", 1, 2) specifies a starting position of 1 and a length of 2, so the function starts at the letter "A" and returns two characters, resulting in the string "AB".

Similarly, your first example of SUBSTRN("ABCDE", -1, 4) specifies a starting position of -1 and a length of 4. When the starting position is negative or zero, SUBSTRN returns a string that starts at the first position (here the letter "A") and adjusts the length of the string returned to account for the start value. If the starting position is 0, SAS subtracts 1 off the length. If the starting position is -1, SAS subtracts 2 off the length, etc. For a starting position of -1 and length of 4, think of it, for example, that you count -1, 0, 1, 2, but SAS only returns the characters starting at position 1, resulting in the string "AB".


Check out Example 1- Manipulating Strings with the SUBSTRN Function- in the SUBSTRN documentation 
for another example.

@JackieJ_SAS Thank you for the explanation! 

Version history
Last update:
‎03-26-2025 09:41 AM
Updated by:
Contributors

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Article Labels
Article Tags