Hi ,
I want to split the below string in to different variables and when there are two consecutive characters then I want to insert 1. I tried putting characters and numbers in different variables but that's not giving what I want. Please help me
String
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
Output:
var1 Var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13 var14
28 , 16 O 1 B 4 N 7 L 8 U 4 L
28 , 12 O 8 B 20 O
46 , 12 C
30 , 14 C 1 B 12 C 4 N 2 C
Thank You
There is probably an easier way to do this, but it's late and I'm tired:
data have;
informat string $50.;
input String;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
data want (drop=string i);
set have;
string=prxchange('s/([0-9])([,A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/(,)([0-9A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/([0-9A-Za-z])(,)/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([0-9])/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([A-Za-z])/$1_1_$2/', -1, STRING);
array var(14) $;
i=1;
do while(scan(string,i,'_') ne '');
var(i)=scan(string,i,'_');
i+1;
end;
run;
data have;
input string:$100.;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
data want;
set have;
var1=substr(string,1,2);
var2=substr(string,3,1);
var3=substr(string,4,2);
array vas(*) $ var4-var14;
do i = 1 to dim(vas);
vas(i)=substr(compress(string),i+5,1);
end;
run;
Thanks,
Jag
There is probably an easier way to do this, but it's late and I'm tired:
data have;
informat string $50.;
input String;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
data want (drop=string i);
set have;
string=prxchange('s/([0-9])([,A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/(,)([0-9A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/([0-9A-Za-z])(,)/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([0-9])/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([A-Za-z])/$1_1_$2/', -1, STRING);
array var(14) $;
i=1;
do while(scan(string,i,'_') ne '');
var(i)=scan(string,i,'_');
i+1;
end;
run;
Thank you your code worked correct for me, but if there is no number in-front of character then I need 1.
example:After O theres no number, so in this case I need 1 then B.
For the first string : 28,16OB4N7L8U4L
output:
var1 Var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13 var14
28 , 16 O 1 B 4 N 7 L 8 U 4 L
Hope I'm clear.
Thank you
Something like:
data have;
informat string $50.;
input String;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
run;
data want (keep=old_string final_string);
set have;
length id1-id3 old_string final_string $20;
old_string=string;
array var{40} $20.;
array id{3} $20.;
/* Address the first parts */
id1=substr(string,1,2);
id2=substr(string,3,1);
id3=substr(string,4,2);
string=substr(string,6);
/* Now we have the data left to process loop through each character */
tmp_count=1;
do i=1 to 40 by 2;
var{i}=substr(string,tmp_count,1);
tmp_count=tmp_count+1;
end;
/* Additional step to insert 1's where necessary */
do i=1 to 40 by 2;
if anyalpha(var{i}) and anyalpha(var{i+2}) then var{i+1}="1";
end;
/* Create final complete string */
final_string=cats(of id{*})||cats(of var{*});
run;
Are you sure you ran my code? I think that is exactly what the code did!
Thank You Arthur, your code worked fine.
But I just realized that when string has no number after comma it is not giving value
example :
35,NB15O4L4NB7O
Desired Output:
var1 Var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13 var14 var15 var16
35 , 1 N 1 B 15 O 4 L 4 N 1 B 7 O
Thank you
data have;
input string $40.;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
run;
data temp;
 set have;
 string=prxchange('s/([A-Z])([A-Z])/\11\2/i',-1,string);
 pid=prxparse('/\d+|\W+|[a-z]/i');
 n+1;start=1;stop=length(string);
 call prxnext(pid,start,stop,string,position,length);
 do while(position gt 0);
  found=substr(string,position,length);output;
   call prxnext(pid,start,stop,string,position,length);
 end;
 drop pid start stop position length string ;
run;
proc transpose data=temp out=want(drop=_name_);
 by n;
 var found;
run;
Xia Keshan
Xia,
FWIW, your code appears to do the same thing that my code does, but requires slightly more cpu and real time than the code I had proposed.
The test I ran to compare the two was:
data have (drop=i);
input string $40.;
do i=1to 100000;
output;
end;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
run;
data temp;
set have;
string=prxchange('s/([A-Z])([A-Z])/\11\2/i',-1,string);
pid=prxparse('/\d+|\W+|[a-z]/i');
n+1;start=1;stop=length(string);
call prxnext(pid,start,stop,string,position,length);
do while(position gt 0);
found=substr(string,position,length);output;
call prxnext(pid,start,stop,string,position,length);
end;
drop pid start stop position length string ;
run;
proc transpose data=temp out=want_xia(drop=_name_ n);
by n;
var found;
run;
data want_art (drop=string i);
set have;
string=prxchange('s/([0-9])([,A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/(,)([0-9A-Za-z])/$1_$2/', -1, STRING);
string=prxchange('s/([0-9A-Za-z])(,)/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([0-9])/$1_$2/', -1, STRING);
string=prxchange('s/([A-Za-z])([A-Za-z])/$1_1_$2/', -1, STRING);
array var(14) $;
i=1;
do while(scan(string,i,'_') ne '');
var(i)=scan(string,i,'_');
i+1;
end;
run;
Yeah. You are right ,Arthur.T . I believe that is because you pre-define the number of variables . How do you know that would have to be fourteen ?
Best
Xia Keshan
Xia: Agreed! I hard coded it based on my interpretation of OP's problem statement.
data foo;
length foobar $ 32;
input foobar;
array v[14] $ 2;
v1 = scan(foobar, 1, ',');
v2 = ',';
bar = scan(foobar, 2, ',');
bar = prxchange('s/([a-z])([a-z])/\11\2/io', -1, bar);
bar = prxchange('s/((?:[a-z]+)|(?:[0-9]+))(?!\s)/\1_/io', -1, bar);
do i=3 to dim(v) by 1 while(1);
v=scan(bar, i-2, '_');
if missing(v) then leave;
end;
keep v:;
cards;
28,16OB4N7L8U4L
28,12O8B20O
46,12C
30,14CB12C4N2C
;
run;
Matt: Not more efficient! Took the same time as my code and will only work, as is, if the first two fields always contain a number followed by a comma, and none of the numbers require more than 2 characters.
Obviously those would be easy to change, but no improvement in performance.
Okay Art, I will change it slightly to make it more efficient by breaking apart second prx into two, and why not, lets write it in ds2 also:
2083 data x.have (drop=i);
2084 input string $40.;
2085 do i=1to 1e6;
2086 output;
2087 end;
2088 cards;
NOTE: The data set X.HAVE has 4000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.14 seconds
cpu time 0.14 seconds
2093 ;
2094 run;
2095
2096 data want_fe;
2097 set x.have;
2098 array v[14] $ 2;
2099 v1 = scan(string, 1, ',');
2100 v2 = ',';
2101 bar = scan(string, 2, ',');
2102 bar = prxchange('s/([a-z])([a-z])/\11\2/io', -1, bar);
2103 bar = prxchange('s/([a-z]+)(?!\s)/\1_/io', -1, bar);
2104 bar = prxchange('s/([0-9]+)/\1_/o', -1, bar);
2105 do i=3 to dim(v) by 1 while(1);
2106 v=scan(bar, i-2, '_');
2107 if missing(v) then leave;
2108 end;
2109 keep v:;
2110 run;
NOTE: There were 4000000 observations read from the data set X.HAVE.
NOTE: The data set WORK.WANT_FE has 4000000 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 15.04 seconds
cpu time 15.06 seconds
2111
2112 data want_art (drop=string i);
2113 set x.have;
2114 string=prxchange('s/([0-9])([,A-Za-z])/$1_$2/', -1, STRING);
2115 string=prxchange('s/(,)([0-9A-Za-z])/$1_$2/', -1, STRING);
2116 string=prxchange('s/([0-9A-Za-z])(,)/$1_$2/', -1, STRING);
2117 string=prxchange('s/([A-Za-z])([0-9])/$1_$2/', -1, STRING);
2118 string=prxchange('s/([A-Za-z])([A-Za-z])/$1_1_$2/', -1, STRING);
2119 array var(14) $;
NOTE: The array var has the same name as a SAS-supplied or user-defined function. Parentheses
following this name are treated as array references and not function references.
2120 i=1;
2121 do while(scan(string,i,'_') ne '');
2122 var(i)=scan(string,i,'_');
2123 i+1;
2124 end;
2125 run;
NOTE: There were 4000000 observations read from the data set X.HAVE.
NOTE: The data set WORK.WANT_ART has 4000000 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 20.46 seconds
cpu time 20.39 seconds
2126
2127 proc ds2;
2128
2129 thread break_string / overwrite=yes;
2130 vararray nchar(2) v[14];
2131 keep v:;
2132 method run();
2133 declare nchar(32) foo;
2134 declare int i;
2135 set x.have;
2136 v[1]=scan(string, 1, ',');
2137 v[2]=',';
2138 foo=scan(string, 2, ',');
2139 foo=prxchange('s/([a-z])([a-z])/\11\2/io', -1, foo);
2140 foo=prxchange('s/([a-z]+)(?!\s)/\1_/io', -1, foo);
2141 foo=prxchange('s/([0-9]+)/\1_/o', -1, foo);
2142 do i=3 to dim(v);
2143 v=scan(foo, i-2, '_');
2144 if missing(v) then leave;
2145 end;
2146 end;
2147 endthread;
2148 run;
NOTE: Created thread break_string in data set work.break_string.
NOTE: Execution succeeded. No rows affected.
2149
2150 data want_feds2 (overwrite=yes);
2151 declare thread break_string bsthread;
2152 method run();
2153 set from bsthread threads=4;
2154 end;
2155 enddata;
2156 run;
NOTE: BASE driver, creation of a NCHAR column has been requested, the table encoding is set to
UTF-8.
NOTE: Execution succeeded. 4000000 rows affected.
2157
2158 quit;
NOTE: PROCEDURE DS2 used (Total process time):
real time 5.11 seconds
cpu time 20.57 seconds
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
