Hello,
I need to divide a string variable into several component parts. All new variables will be string variables.
Sample data provided below:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 SD
18003 18 003
1730S3 17 30 S3
1704/1 17 04 1
I used the following syntax:
data want;
data have;
Component1 = substr(originalvar,1,2); /*this variable is good*/
Rest=substr(originalvar,3); /*this variable contains everything from position 3 to end of OriginalVar*/
restAlpha=anyalpha(rest); /*identify position of 1st alpha in "REST"*/
restPunct=anypunct(rest); /*identify position of 1st punctuation in "REST"*/
restDigit=anydigit(rest); /*identify position of 1st digit in "REST"*/
if restAlpha = 0 and restPunct = 0 then Component2 = Rest;/*this syntax correctly creates Component2*/
else if restAlpha gt 0 and restPunct eq 0 then Component2 = substr(Rest,restDigit,restAlpha-restDigit);
else if restAlpha eq 0 and restPunct gt 0 then Component2 = substr(Rest,restDigit,restPunct-restDigit);
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' '; /*Component3 is not properly generated*/
else if restAlpha gt 0 and restPunct eq 0 then Component3 = substr(Rest,restAlpha,3);
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct,3);
end;
This syntax generates the following:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 S /*this is wrong - Component3 should = 'SD'*/
18003 18 003
1730S3 17 30 S /*this is wrong - Component3 should = 'S3'*/
1704/1 17 04 / /*this is wrong - Component3 should = '1'*/
Any suggestions for how to fix my syntax?
Thanks so much!
Given that you insert the space where Reeza suggested, there are a couple of additional issues to attend to.
First, COMPONENT3 is defined as one character long. There are a couple of easy ways to fix this, one being to add a second space inside the quotes:
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' ';
Second, when you find the punctuation, you actually want to begin locating COMPONENT3 with the next character, not where the punctuation is found. So:
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct+1,3);
The log generates a lot of errors. Did you fix those first?
Or is the code shown not your actual code?
Mistakes:
1. No SET but two DATA statements (line 2)
2. missing space I used an underscore to illustrate where the issue is.
if restAlpha = 0 and restPunct = 0 then_Component3 = ' '; /*Component3 is not properly generated*/
Results are correct after these fixes.
They were all noted in the log, so please read your log in the future and include it your original post.
@SM1 wrote:
Hello,
I need to divide a string variable into several component parts. All new variables will be string variables.
Sample data provided below:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 SD
18003 18 003
1730S3 17 30 S3
1704/1 17 04 1
I used the following syntax:
data want;
data have;
Component1 = substr(originalvar,1,2); /*this variable is good*/
Rest=substr(originalvar,3); /*this variable contains everything from position 3 to end of OriginalVar*/
restAlpha=anyalpha(rest); /*identify position of 1st alpha in "REST"*/
restPunct=anypunct(rest); /*identify position of 1st punctuation in "REST"*/
restDigit=anydigit(rest); /*identify position of 1st digit in "REST"*/
if restAlpha = 0 and restPunct = 0 then Component2 = Rest;/*this syntax correctly creates Component2*/
else if restAlpha gt 0 and restPunct eq 0 then Component2 = substr(Rest,restDigit,restAlpha-restDigit);
else if restAlpha eq 0 and restPunct gt 0 then Component2 = substr(Rest,restDigit,restPunct-restDigit);
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' '; /*Component3 is not properly generated*/
else if restAlpha gt 0 and restPunct eq 0 then Component3 = substr(Rest,restAlpha,3);
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct,3);
end;
This syntax generates the following:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 S /*this is wrong - Component3 should = 'SD'*/
18003 18 003
1730S3 17 30 S /*this is wrong - Component3 should = 'S3'*/
1704/1 17 04 / /*this is wrong - Component3 should = '1'*/
Any suggestions for how to fix my syntax?
Thanks so much!
Given that you insert the space where Reeza suggested, there are a couple of additional issues to attend to.
First, COMPONENT3 is defined as one character long. There are a couple of easy ways to fix this, one being to add a second space inside the quotes:
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' ';
Second, when you find the punctuation, you actually want to begin locating COMPONENT3 with the next character, not where the punctuation is found. So:
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct+1,3);
Component3 exists in the original data with the original length, it's not a newly created variable.
If @Reeza is correct about this, and COMPONENT3 already exists before the DATA step begins, you might still have the same problem. COMPONENT3 might be already defined as having a length of $1. You can easily change that by inserting this statement BEFORE the SET statement:
length component3 $ 2;
Thanks for your comments.
Yes, I tried to simplify my syntax and typed it (rather than cutting and pasting). The variable names are not descriptive and quickly become confusing.
My actual syntax (with additional comments) follows:
data rr_rptsernum;
set rr_srutype;
length rptsernumA rptsernum1 $15.; /*added length statements as suggested*/
length rptsernum1a rptsernum1b rptsernum2 $10.;
length rptsernum3a rptsernum3b RRnum3 $5.;
rptsernum = compress(rptsernum,'-)','C');
rptsernumA = tranwrd(rptsernum,' (','/');/*replace ' (' with '/'*/
rptsernum1 = tranwrd(rptsernumA,'XX','999');
rptsernum1a = substr(rptsernum1,1,2); /*retains only initial 2 digits = year*/
rptsernum1b = substr(rptsernum1,3); /*retains everything else (rpt number and extra text)*/
rptsernum2 = scan(rptsernum1b,1,' '); /*deletes everything that follows a space*/
rptsernum2a = anyalpha(rptsernum2);
rptsernum2b = anypunct(rptsernum2);
rptsernum2c = anydigit(rptsernum2);
if rptsernum2a = 0 and rptsernum2b = 0 then rptsernum3a = rptsernum2;
else if rptsernum2a gt 0 and rptsernum2b eq 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2a-rptsernum2c);
else if rptsernum2a eq 0 and rptsernum2b gt 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2b-rptsernum2c);
else if rptsernum2a gt 0 and rptsernum2b gt 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2b-rptsernum2c);
if rptsernum2a = 0 and rptsernum2b = 0 then rptsernum3b = ' ';
else if rptsernum2a gt 0 and rptsernum2b eq 0 then rptsernum3b = substr(rptsernum2,rptsernum2a,3);
else if rptsernum2a eq 0 and rptsernum2b gt 0 then rptsernum3b = substr(rptsernum2,rptsernum2b+1,3); /*added '+1' as suggested*/
else if rptsernum2a gt 0 and rptsernum2b gt 0 then rptsernum3b = substr(rptsernum2,rptsernum2b+1,3); /*added '+1' as suggested*/
RRnum1 = input(rptsernum1a, 4.);
RRnum2 = input(rptsernum3a, 4.);
RRnum3 = rptsernum3b;
run;
Including the length statements and adding '+1' to account for the punctuation were both required.
The syntax now properly divides the original variable into the required component parts.
Thanks so much for your assistance!
SM1
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.