Re: SUBSTR function not working! Please help me to solve the followin...

Roger · Posted 10-12-2012 05:45 PM

1. I need to delete 10 digits from right including @ sign on email account and leave only login name for Account ID column (i.e. mjohn)

2. You may also notice – if the names are repeated, email account is attached in parenthesis with their names. I will also need to remove the email account including the parenthesis from Full Name column if the names have their email account attached (i.e. Tom Lee (tlee2@gmail.com) = Tom Lee)

Thanks,

Roger

Input data:

Full_Name Account_ID

------------ --------------

Mary Johnson mjohn@gmail.com

Mary Jonson (mjohn2@gmail.com) mjohn2@gmail.com

Robin King rking@gmail.com

Tom Lee tlee@gmail.com

Tom Lee (tlee2@gmail.com) tlee2@gmail.com

Need the following output:

Full_Name Account_ID

------------ --------------

Mary Johnson mjohn

Mary Jonson mjohn2

Robin King rking

Tom Lee tlee

Tom Lee tlee2

Reeza · Posted 10-12-2012 06:07 PM

Use the scan function with the "@" as a delimiter.

scan(account_id, 1, "@")

You can use compress function to remove brackets.

Roger · Posted 10-16-2012 06:13 PM

Hi Reeza, Thanks for your help!

I'm still having an issue with the login ID that comes in parantheses. I was able to use compress function to remove the parantheses but login ID still shows up after the the last name, e.g. Mary Jonson mjohn2.

And some people have a middle initial, e.g. Mary E. Jonson mjohn2.

Is there a way to remove the login ID at the end of their names? (in some cases it's a 4th word and in some cases it's a 3rd word, depending on whether they have a middle initial or not)

SASKiwi · Posted 10-16-2012 07:06 PM

One way to do this is like so - FINDC finds the first blank searching from right to left with any trailing blanks removed first, then SUBSTR extracts all characters up to that position:

data test;

string = 'Mary E. Jonson mjohn2 ';

result = substr(string, 1, findc(trim(string), ' ', 'B') - 1);

put _all_;

run;

RichardinOz · Posted 10-16-2012 07:02 PM

To identify the account id try this:

I can't tell from your data whether the envelope symbol is encapsulated in your data or is an artifact of the HTML posting. If the latter, using scan(account_id, -2, ") @") will pick out the second to last "word" in account_is, where the word boundary is delimited by "@", space or right parenthesis. If the symbol is in your data and you can identify it by its byte code you may be able to add that into the list of scan delimiters, otherwise a subsequent account_id = substr (account_id, 2, 99) should do the trick. NB don't compress out the parentheses for this to work.

From your second posting though it appears you have a problem splitting a single input line. If that is the case, first find the account_id by applying the scan function to the input data, then use the INDEX function to locate its first occurrence in the input data

pos = index (rawdata, account_id) ;

Then

full_name = substr (rawdata, 1, pos - 2) ;