Solved: Re: Split h1-h10 into h1,h2,h3,h4,.... h10

Richardvan_tHoff · Posted 09-02-2013 07:59 AM

I want to translate the contents of a column.

For example, the content includes the following value (this is the content of 1 column in 1 row):

G1, G2, G4, X1 -X3, H51-H55, T10

I want to translate that to different rows.

The result should look like this:

G1

G2

G4

X1

X2

X3

H51

H52

H53

H54

H55

T10

I have not problems making rows of the single ones (I mean G1 G2 G4 and T10). The function scan works perfectly.

The problem is how I translate X1-X3 into X1, X2, X3 and the same for H51-H55.

Haikuo · Posted 09-02-2013 01:21 PM

Yes, DN, can never go under your radar -:). ANYFIRST should have a shot, while at this stage it seems to me using PRX is more straightforward.

data test;

input var $ 80.;

length new_var $ 20;

do _j=1 to countw(var,',');

_var=scan(var,_j,',');

if find(_var,'-') then do;

_v1=scan(_var,1,'- ');

_v2=scan(_var,2,'- ');

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),_i);

output;

end;

else do;new_var=left(_var);output;end;

end;

drop _:;

cards;

G1, G2, G4, X1 -X3, H51-H55, T10,NA3ME1-NA3ME3, 34WE100-34WE103

;

Haikuo

View solution in original post

Haikuo · Posted 09-02-2013 09:30 AM

With a combination of bunch of Char functions, we can have something like this:

data test;

input var $ 50.;

length new_var $ 8;

do _j=1 to countw(var,',');

_var=scan(var,_j,',');

if find(_var,'-') then

do _i=compress(scan(_var,1,'-'),,'kd') to compress(scan(_var,2,'-'),,'kd');

new_var=cats(compress(scan(_var,1,'-'),,'d'),_i);

output;

end;

else do;new_var=left(_var);output;end;

end;

drop _:;

cards;

G1, G2, G4, X1 -X3, H51-H55, T10

;

Haikuo

data_null__ · Posted 09-02-2013 09:43 AM

What if a word has an imbedded digit? NA3ME1-NA3ME1

Consider using the ANYFIRST function with negative start.

Haikuo · Posted 09-02-2013 01:21 PM

Yes, DN, can never go under your radar -:). ANYFIRST should have a shot, while at this stage it seems to me using PRX is more straightforward.

data test;

input var $ 80.;

length new_var $ 20;

do _j=1 to countw(var,',');

_var=scan(var,_j,',');

if find(_var,'-') then do;

_v1=scan(_var,1,'- ');

_v2=scan(_var,2,'- ');

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),_i);

output;

end;

else do;new_var=left(_var);output;end;

end;

drop _:;

cards;

G1, G2, G4, X1 -X3, H51-H55, T10,NA3ME1-NA3ME3, 34WE100-34WE103

;

Haikuo

Richardvan_tHoff · Posted 09-11-2013 08:17 AM

Thanks all. I am using the code already from hai.kuo. Only made a minor change, because the variable could also contain only numbers.

data test;

input var $ 80.;

length new_var $ 20;

do _j=1 to countw(var,',');

_var=scan(var,_j,',');

if find(_var,'-') then do;

_v1=scan(_var,1,'- ');

_v2=scan(_var,2,'- ');

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

IF(prxmatch("/\D+/",strip(_v1))) THEN

new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),_i);

ELSE

new_var=_i;

output;

end;

else do;new_var=left(_var);output;end;

end;

drop _:;

cards;

G1, G2, G4, X1 -X3, H51-H55, T10,NA3ME1-NA3ME3, 34WE100-34WE103, 1-10

;

The only problem I am trying to solve is when you have H01-H10. The above code would split it in the below rows

H1

H2

H3

H4

...

H10

While it should be

H01

H02

H03

H04

...

H10

Vince28_Statcan · Posted 09-11-2013 09:51 AM

That's actually fairly tricky to add within a single data step since you can't get to enforce a zn.w format. I see 2 alternatives

1: Use PRX to capture number of leading 0s from your lower boundary and then control the number of those that are output based on your _i counter (like remove one 0 from the captured lead for each increment in log10(_i)). It would resemble something like this (untested)

length lead $20.; /*just so that prxchange output does not get converted to numeric*/

...

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

IF(prxmatch("/\D+/",strip(_v1))) THEN

lead=prxchange('s/(\w*\D+)(0*)(\d+)/$2/',-1,_v1); /*you could change \d+ to [1-9]\d+ or something to clearly identify the purpose of 0* but since * is greedy its only a minor efficiency issue */

new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1), length), substrn(left(lead), 1, length(trim(left(lead)))-(floor(log10(_i))-floor(log10(prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1)))) ) ,_i);

...

drop lead;

2: Another way would build a series of case based on the length of the fully captured digit as a string for the lower bound and use put& zn.w formats e.g.

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

IF(prxmatch("/\D+/",strip(_v1))) THEN

zlen = length(prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1));

if zlen=1 then new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),_i);

else if zlen=2 then new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),put(_i, z2.);

...

else if zlen=8 then new_var=cats(prxchange('s/(\w*\D+)(\d+)/$1/',-1,_v1),put(_i, z8.));

art297 · Posted 09-11-2013 09:54 AM

: The method I suggested will retain the leading zeros if present in the variable list.

data_null__ · Posted 09-11-2013 10:26 AM

Arthur Tabachneck wrote:

Richard van't Hoff: The method I suggested will retain the leading zeros if present in the variable list.

I think there will be a problem defining variables with name like (at least in an input statement).

34WE100-34WE103, 1-10

I could not get that to work without making the words name literials. Of course it wouldn't be too hard to add some code to convert those words to name literials.

data_null__ · Posted 09-11-2013 10:41 AM

Maybe this.

options validvarname=any;
%let list = G1, G2, G4, X01 -X03, H51-H55, T10,NA3ME1-NA3ME3, 34WE100-34WE103 1-10 009-005;

data _null_; 
   length list $256 w $32 newlist $256; 
   list = transtrn(translate(symget('LIST'),' ',','),'-',' - ');
   do i = 1 by 1; 
      w = scan(list,i,' ');
      if missing(w) then leave;
      if w ne '-' then w = nliteral(w);
      newlist = catx(' ',newlist,w);
      end; 
   call symputx('LIST',newlist);
   put _all_; 
   run; 

data names;
   input &list;
   stop; 
   cards; 
;;;;
   run; 

proc transpose data=names out=names2;
   var _all_;
   run; 
proc print; 
   run;

art297 · Posted 09-11-2013 01:17 PM

I didn't even notice the 1-10. Hopefully, that was just a typo by the OP. I was simply addressing the case like T01-T10.

Richardvan_tHoff · Posted 09-16-2013 09:40 AM

Thanks all.

I finally got it working for everything (even for my last problem).

My solution

data work.testje;

input var $ 100.;

length new_var $ 20;

do _j=1 to countw(var,',');

_var=scan(var,_j,',');

if find(_var,'-') then do;

_v1=scan(_var,1,'- ');

_v2=scan(_var,2,'- ');

do _i=prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v1) to prxchange('s/(\w*\D+)(\d+)/$2/',-1,_v2);

_len_new=LENGTH(STRIP(_i));

_len_old=LENGTH(SUBSTR(STRIP(_v1),(LENGTH(STRIP(_v1))-ANYALPHA(STRIP(REVERSE(_v1)))) +2));

_new_i=REPEAT('0',_len_old - _len_new - 1)||STRIP(_i);

_turned=REVERSE(STRIP(_v1));

SUBSTR(_turned,1,_len_old)=REVERSE(STRIP(_new_i));

new_var=REVERSE(STRIP(_turned));

output;

end;

else do;new_var=left(_var);output;end;

end;

cards;

G1, G2, G4, X1 -X3, H051-H055, T10,NA03ME1-NA03ME3, 34WE100-34WE103,001-008, 1-10, H0001- H0010

data_null__ · Posted 09-02-2013 09:45 AM

These look like variable lists. If they are you might be able to use that to your advantage.

art297 · Posted 09-02-2013 12:04 PM

I agree with datanull. If they are variable lists, you could use something like the following to convert them into a macro variable and then scan the macro variable. e.g.:

data have;

informat contents $80.;

input contents &;

cards;

G1, G2, G4, X1 -X3, H51-H55, T10

;

data _null_;

set have;

contents=translate(contents,' ',',');

call symput('varlist',contents);

run;

data test;

input &varlist.;

cards;

;

run;

proc sql noprint;

select name

into :vars separated by " "

from dictionary.columns

where libname="WORK" and

memname="TEST"

;

quit;

%put &vars.;

Astounding · Posted 09-02-2013 01:46 PM

Art,

Got this from Tom, who got it from DN, who got it elsewhere ...

An alternative ending to your solution maintains the order of the names, and produces output as a single column. Instead of reading from DICTIONARY.COLUMNS, try:

proc transpose data=test (obs=0) out=want;

var &varlist;

run;

art297 · Posted 09-02-2013 02:40 PM

: Yes, I was familiar with the method, but not sure of its origin either. Tom, Randy Herbison, FriedEgg and I have a nifty use of it in a paper we're presenting at the MWSUG meeting later this month. Take a look at: http://www.sascommunity.org/mwiki/images/c/c5/S1-14-2013.pdf

However, I usually like the SQL method more often as one can order the variables based on any variable that is contained in dictionary.columns, including name and column number.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away