Hello,
I need to divide a string variable into several component parts. All new variables will be string variables.
Sample data provided below:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 SD
18003 18 003
1730S3 17 30 S3
1704/1 17 04 1
I used the following syntax:
data want;
data have;
Component1 = substr(originalvar,1,2); /*this variable is good*/
Rest=substr(originalvar,3); /*this variable contains everything from position 3 to end of OriginalVar*/
restAlpha=anyalpha(rest); /*identify position of 1st alpha in "REST"*/
restPunct=anypunct(rest); /*identify position of 1st punctuation in "REST"*/
restDigit=anydigit(rest); /*identify position of 1st digit in "REST"*/
if restAlpha = 0 and restPunct = 0 then Component2 = Rest;/*this syntax correctly creates Component2*/
else if restAlpha gt 0 and restPunct eq 0 then Component2 = substr(Rest,restDigit,restAlpha-restDigit);
else if restAlpha eq 0 and restPunct gt 0 then Component2 = substr(Rest,restDigit,restPunct-restDigit);
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' '; /*Component3 is not properly generated*/
else if restAlpha gt 0 and restPunct eq 0 then Component3 = substr(Rest,restAlpha,3);
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct,3);
end;
This syntax generates the following:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 S /*this is wrong - Component3 should = 'SD'*/
18003 18 003
1730S3 17 30 S /*this is wrong - Component3 should = 'S3'*/
1704/1 17 04 / /*this is wrong - Component3 should = '1'*/
Any suggestions for how to fix my syntax?
Thanks so much!
Given that you insert the space where Reeza suggested, there are a couple of additional issues to attend to.
First, COMPONENT3 is defined as one character long. There are a couple of easy ways to fix this, one being to add a second space inside the quotes:
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' ';
Second, when you find the punctuation, you actually want to begin locating COMPONENT3 with the next character, not where the punctuation is found. So:
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct+1,3);
The log generates a lot of errors. Did you fix those first?
Or is the code shown not your actual code?
Mistakes:
1. No SET but two DATA statements (line 2)
2. missing space I used an underscore to illustrate where the issue is.
if restAlpha = 0 and restPunct = 0 then_Component3 = ' '; /*Component3 is not properly generated*/
Results are correct after these fixes.
They were all noted in the log, so please read your log in the future and include it your original post.
@SM1 wrote:
Hello,
I need to divide a string variable into several component parts. All new variables will be string variables.
Sample data provided below:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 SD
18003 18 003
1730S3 17 30 S3
1704/1 17 04 1
I used the following syntax:
data want;
data have;
Component1 = substr(originalvar,1,2); /*this variable is good*/
Rest=substr(originalvar,3); /*this variable contains everything from position 3 to end of OriginalVar*/
restAlpha=anyalpha(rest); /*identify position of 1st alpha in "REST"*/
restPunct=anypunct(rest); /*identify position of 1st punctuation in "REST"*/
restDigit=anydigit(rest); /*identify position of 1st digit in "REST"*/
if restAlpha = 0 and restPunct = 0 then Component2 = Rest;/*this syntax correctly creates Component2*/
else if restAlpha gt 0 and restPunct eq 0 then Component2 = substr(Rest,restDigit,restAlpha-restDigit);
else if restAlpha eq 0 and restPunct gt 0 then Component2 = substr(Rest,restDigit,restPunct-restDigit);
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' '; /*Component3 is not properly generated*/
else if restAlpha gt 0 and restPunct eq 0 then Component3 = substr(Rest,restAlpha,3);
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct,3);
end;
This syntax generates the following:
OriginalVar Component1 Component2 Component3
1702 17 02
1627M 16 27 M
1628SD 16 28 S /*this is wrong - Component3 should = 'SD'*/
18003 18 003
1730S3 17 30 S /*this is wrong - Component3 should = 'S3'*/
1704/1 17 04 / /*this is wrong - Component3 should = '1'*/
Any suggestions for how to fix my syntax?
Thanks so much!
Given that you insert the space where Reeza suggested, there are a couple of additional issues to attend to.
First, COMPONENT3 is defined as one character long. There are a couple of easy ways to fix this, one being to add a second space inside the quotes:
if restAlpha = 0 and restPunct = 0 thenComponent3 = ' ';
Second, when you find the punctuation, you actually want to begin locating COMPONENT3 with the next character, not where the punctuation is found. So:
else if restAlpha eq 0 and restPunct gt 0 then Component3 = substr(Rest,restPunct+1,3);
Component3 exists in the original data with the original length, it's not a newly created variable.
If @Reeza is correct about this, and COMPONENT3 already exists before the DATA step begins, you might still have the same problem. COMPONENT3 might be already defined as having a length of $1. You can easily change that by inserting this statement BEFORE the SET statement:
length component3 $ 2;
Thanks for your comments.
Yes, I tried to simplify my syntax and typed it (rather than cutting and pasting). The variable names are not descriptive and quickly become confusing.
My actual syntax (with additional comments) follows:
data rr_rptsernum;
set rr_srutype;
length rptsernumA rptsernum1 $15.; /*added length statements as suggested*/
length rptsernum1a rptsernum1b rptsernum2 $10.;
length rptsernum3a rptsernum3b RRnum3 $5.;
rptsernum = compress(rptsernum,'-)','C');
rptsernumA = tranwrd(rptsernum,' (','/');/*replace ' (' with '/'*/
rptsernum1 = tranwrd(rptsernumA,'XX','999');
rptsernum1a = substr(rptsernum1,1,2); /*retains only initial 2 digits = year*/
rptsernum1b = substr(rptsernum1,3); /*retains everything else (rpt number and extra text)*/
rptsernum2 = scan(rptsernum1b,1,' '); /*deletes everything that follows a space*/
rptsernum2a = anyalpha(rptsernum2);
rptsernum2b = anypunct(rptsernum2);
rptsernum2c = anydigit(rptsernum2);
if rptsernum2a = 0 and rptsernum2b = 0 then rptsernum3a = rptsernum2;
else if rptsernum2a gt 0 and rptsernum2b eq 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2a-rptsernum2c);
else if rptsernum2a eq 0 and rptsernum2b gt 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2b-rptsernum2c);
else if rptsernum2a gt 0 and rptsernum2b gt 0 then rptsernum3a = substr(rptsernum2,rptsernum2c,rptsernum2b-rptsernum2c);
if rptsernum2a = 0 and rptsernum2b = 0 then rptsernum3b = ' ';
else if rptsernum2a gt 0 and rptsernum2b eq 0 then rptsernum3b = substr(rptsernum2,rptsernum2a,3);
else if rptsernum2a eq 0 and rptsernum2b gt 0 then rptsernum3b = substr(rptsernum2,rptsernum2b+1,3); /*added '+1' as suggested*/
else if rptsernum2a gt 0 and rptsernum2b gt 0 then rptsernum3b = substr(rptsernum2,rptsernum2b+1,3); /*added '+1' as suggested*/
RRnum1 = input(rptsernum1a, 4.);
RRnum2 = input(rptsernum3a, 4.);
RRnum3 = rptsernum3b;
run;
Including the length statements and adding '+1' to account for the punctuation were both required.
The syntax now properly divides the original variable into the required component parts.
Thanks so much for your assistance!
SM1
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.