Hello,
i have a variable separated by ;
Examples :
D01;D02;D03;D04;D05; should become D01-D05
D01;D02;D03;D05; should become D01-D03;D05
D10;D11;D12;D13;D17;D20;D21;D22 should become D10-D13;D17;D20-D22;
Any help to get me on my way?
Thanks!
Here is one version that can be used for more general conditions, such as undefined length of substrings, undefined number of character or digits within the substrings.
data have;
input var $ 60.;
datalines4;
D01;D02;D03;D04;D05;AA101;AA102;AA103;
D01;D02;D03;D05;E01;F01;F02;
D10;D11;D12;D13;D17;D20;D21;D22;
;;;;
data want (drop=_:);
set have;
length _var VarFinal $100.;
do _i=1 by 1 until (missing (scan(var,_i,';')));
if _i=1 then _sv=scan(var,_i,';');
_mv1=scan(var,_i,';');
_mc1=compress(_mv1,,'ka');
_mn1=compress(_mv1,,'kd');
_mv2=scan(var,_i+1,';');
_mc2=compress(_mv2,,'ka');
_mn2=compress(_mv2,,'kd');
if (_mc1 ne _mc2 or _mn2-_mn1>1) then
do;
if _sv = _mv1 then _var=_sv;
else _var=catx('-',_sv,_mv1);
_sv=_mv2;
end;
if not missing(_var) then varFinal=catx(';',varFinal,_var);
call missing(_var);
end;
run;
proc print;run;
Following code can also applied to more general condition, such as: D01;E01;
Limitation: Can only work on 3-character substring like this: D01; while it can be tweaked to accomodate more complex scenarios.
data have;
input var $40.;
datalines4;
D01;D02;D03;D04;D05;
D01;D02;D03;D05;E01;F01;F02;
D10;D11;D12;D13;D17;D20;D21;D22;
;;;;
data want (drop=_:);
set have;
length _var VarFinal $40.;
do _i=1 by 1 until (missing (scan(var,_i,';')));
if _i=1 then _sv=scan(var,_i,';');
_mv1=scan(var,_i,';');
_mv2=scan(var,_i+1,';');
if (first(_mv1) ne first(_mv2) or substr(_mv2,2,2)-substr(_mv1,2,2)>1) then
do;
if _sv = _mv1 then _var=_sv;
else _var=catx('-',_sv,_mv1);
_sv=_mv2;
end;
if not missing(_var) then varFinal=catx(';',varFinal,_var);
call missing(_var);
end;
run;
proc print;run;
Regards,
Haikuo
Here is one version that can be used for more general conditions, such as undefined length of substrings, undefined number of character or digits within the substrings.
data have;
input var $ 60.;
datalines4;
D01;D02;D03;D04;D05;AA101;AA102;AA103;
D01;D02;D03;D05;E01;F01;F02;
D10;D11;D12;D13;D17;D20;D21;D22;
;;;;
data want (drop=_:);
set have;
length _var VarFinal $100.;
do _i=1 by 1 until (missing (scan(var,_i,';')));
if _i=1 then _sv=scan(var,_i,';');
_mv1=scan(var,_i,';');
_mc1=compress(_mv1,,'ka');
_mn1=compress(_mv1,,'kd');
_mv2=scan(var,_i+1,';');
_mc2=compress(_mv2,,'ka');
_mn2=compress(_mv2,,'kd');
if (_mc1 ne _mc2 or _mn2-_mn1>1) then
do;
if _sv = _mv1 then _var=_sv;
else _var=catx('-',_sv,_mv1);
_sv=_mv2;
end;
if not missing(_var) then varFinal=catx(';',varFinal,_var);
call missing(_var);
end;
run;
proc print;run;
Going further, here is another more general approach which works on unsorted substrings with the same variable:
data have;
input var $ 60.;
datalines4;
AA103;D01;D05;D03;D04;D02;AA101;AA102;
F01;D02;D03;D05;E01;D01;F02;
D1000;D11;D22;D13;D17;D20;D21;D12;
;;;;
proc sql;
select max(countw(var)) into :num from have;quit;
filename new_v "h:\temp\new_v.txt";
data _null_ ;
set have;
file new_v;
array v $10 _v1-_v%sysevalf(&num+1);
do _i=1 by 1 until (missing (scan(var,_i,';')));
v(_i)=scan(var,_i,';');
end;
call sortc(of v(*));
put v(*);
run;
data want (drop=_:) ;
infile new_v;
length _var VarFinal $100.;
input var : & $300.;
do _i=1 by 1 until (missing (scan(var,_i)));
if _i=1 then _sv=scan(var,_i);
_mv1=scan(var,_i);
_mc1=compress(_mv1,,'ka');
_mn1=compress(_mv1,,'kd');
_mv2=scan(var,_i+1);
_mc2=compress(_mv2,,'ka');
_mn2=compress(_mv2,,'kd');
if (_mc1 ne _mc2 or abs(_mn2-_mn1)>1 ) then
do;
if _sv = _mv1 then _var=_sv;
else _var=catx('-',_sv,_mv1);
_sv=_mv2;
end;
if not missing(_var) then varFinal=catx(';',varFinal,_var);
call missing(_var);
end;
run;
proc print;run;
thanks for all your answers
I have to calculate a second variable now:
I have the same strings:
D01;D02;D03;D04;D05; should become DDDDD
D01;D02;D03;D05; should become DDD-D
D10;D11;D12;D13;D17;D20;D21;D22 should become ---------DDDD---D--DDD;
Question:
The '-' in between groups, is it always just one? or quantiy of '-'s reflects the size of the gap?
the - reflects the size of the gap, so for each number that is not in the row there is a -
I see you already have an answer. Anyway, if pursuing the same logic that I have applied on the previous ones, here is one solution:
This only applied to the same character scenario, as I don't know your rules about transits between , say D02 to FF03, if there are such things in real life.
data have;
input var $40.;
datalines4;
D01;D02;D03;D04;D05;
D01;D02;D03;D05;
D10;D11;D12;D13;D17;D20;D21;D22;
;;;;
data want (drop=_:);
set have;
length VarFinal $40.;
do _i=1 by 1 until (missing (scan(var,_i,';')));
if _i=1 then
do;
_sv=scan(var,_i,';');
_svc=substr(_sv,1,1);
VarFinal=_svc;
end;
_mv1=scan(var,_i,';');
_mv2=scan(var,_i+1,';');
_m21=substr(_mv2,2,2)-substr(_mv1,2,2);
if _m21=1 then Varfinal=cats(Varfinal,_svc);
else if _m21>1 then
do;
do _k=1 by 1 to (_m21-1);
VarFinal=cats(VarFinal,'-');
end;VarFinal=cats(VarFinal,_svc);
end;
end;
run;
proc print;run;
This one is actually easier than the original problem.
data have;
input var $ 60.;
datalines4;
D01;D02;D03;D04;D05;
D01;D02;D03;D05;
A10;A11;A12;A13;A17;A20;A21;A22;
;;;;
data want (drop=_:);
set have;
length newvar $100 _subvar $5 _gapchar $100;
do _i = 1 by 1 until ( missing( _subvar ) );
_subvar = scan( var, _i, ';' );
_prefix = compress( _subvar,, 'ka' );
_suffix = compress( _subvar,, 'kd' );
_lagsuffix = lag( _suffix );
_gapchar = '';
* check for gap;
_gap = _suffix - coalesce(_lagsuffix,0) - 1;
if _gap > 0 then _gapchar = repeat( '-', _gap-1 );
newvar = cats( newvar, _gapchar, _prefix );
end;
run;
Here is another variation using the LAG function:
data want (drop=_:);
set have;
length newvar $100 _subvar $5;
_isincrement = 0;
do _i = 1 by 1 until ( missing( _subvar ) );
_subvar = scan( var, _i, ';' );
_prefix = compress( _subvar,, 'ka' );
_lagprefix = lag( _prefix );
_suffix = compress( _subvar,, 'kd' );
_lagsuffix = lag( _suffix );
_lagsubvar = lag( _subvar );
if _prefix = _lagprefix and _suffix = _lagsuffix + 1 then do;
_isincrement = 1;
end;
else do;
if _isincrement then newvar = catx( '-', newvar, _lagsubvar );
newvar = catx( ';', newvar, _subvar );
_isincrement = 0;
end;
end;
run;
I prefer to Linlin's solution.
data have; input var $40.; datalines4; D01;D02;D03;D04;D05; D01;D02;D03;D05; D10;D11;D12;D13;D17;D20;D21;D22; ;;;; run; data temp1(keep=group d num); set have; group+1; i=1; d=substr(scan(var,i,';'),1,1); num=input(substr(scan(var,i,';'),2),best8.); do while(not missing(d)); output; i+1; d=substr(scan(var,i,';'),1,1); num=input(substr(scan(var,i,';'),2),best8.); end; run; proc sort data=temp1;by group num;run; data temp2(keep=group num); set temp1(rename=(num=_num)); by group; if last.group then do; do num=1 to _num; output; end; end; run; data x; merge temp1 temp2; by group num; if missing(d) then d='-'; run; data want(keep=have want); set x; by group; length have want $ 400; retain have want; if first.group then call missing(want,have); if d ne '-' then have=catx(';',have,cats(d,num)); want=cats(want,d); if last.group then output; run;
Ksharp
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.