Re: Split out Delimiter Character String and count

Doyleuk1983 · Posted 11-09-2020 09:30 AM

Good afternoon All.

I have a table with a singular column which contains a variety of character strings delimited by a colon.

Example:

Data NewData;

length ReasonCode $255.;

Input REASONCODE $;

datalines;

aaa:bbb:ccc:ddd:eee

eee:aaa:fff:mmm

ddd:rrr:yyy:zzz

xxx:ggg

;

run;

No what I want to do is split out the string by the delimiter which I believe I had as per below. This creates Var1, 2, 3 etc from the string and those are populated with aaa, bbb, ccc etc

data NextStep ;

set NewData;

array var(40) $45.;

i=1;

do until (scan(ReasonCode,i,":") eq "");

var(i)=scan(ReasonCode,i,":");

i+1;

end;

run;

Thus creating the following dataset:

REASONCODE          var1 var2 var3 var4 var5
aaa:bbb:ccc:ddd:eee aaa   bbb   ccc   ddd   eee
eee:aaa:fff:mmm     eee   aaa   fff   mmm
ddd:rrr:yyy:zzz     ddd   rrr   yyy   zzz
xxx:ggg             xxx   ggg

Now where I'm getting stuck on logic/coding is that I want to take every iteration of new variable such as aaa, bbb, ccc and make it a Column Heading and count the number of times this appears in the data?

REASONCODE          aaa bbb ccc ddd eee ggg fff mmm rrr xxx yyy zzz
aaa:bbb:ccc:ddd:eee   1   1   1   1   1   0   0   0   0   0   0   0
eee:aaa:fff:mmm       1   0   0   0   1   0   1   1   0   0   0   0
ddd:rrr:yyy:zzz       0   0   0   1   0   0   0   0   1   0   1   1
xxx:ggg               0   0   0   0   0   1   0   0   0   1   0   0

Appreciate any help that you can provide.

novinosrin · Posted 11-09-2020 09:37 AM


Data NewData;

length ReasonCode $255.;

Input REASONCODE $;

datalines;
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
ddd:rrr:yyy:zzz
xxx:ggg
;

run;

data temp;
 set newdata;
 do _n_=1 to countw(ReasonCode,':');
  temp=scan(ReasonCode,_n_,':');
  output;
 end;
run;
proc freq data=temp ;
 tables ReasonCode*temp/noprint sparse out=temp2(drop=percent);;
run;

proc transpose data=temp2 out=want(drop=_:);
 by ReasonCode;
 id temp;
 var count;
run;

Doyleuk1983 · Posted 11-09-2020 11:27 AM

I added in more data and not sure the solution is summing up correctly;

Data NewData;
length RSAP_ReasonCode $255.;
Input RSAP_REASONCODE $;
datalines;
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
ddd:rrr:yyy:zzz
xxx:ggg
aaa:bbb:ccc:ddd:eee:fff
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:hhh
ddd:rrr:yyy:zzz
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:ggg
;
run;

PeterClemmensen · Posted 11-09-2020 09:49 AM

If the variable order is not important

Data NewData;
length ReasonCode $255.;
Input REASONCODE $;
datalines;
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
ddd:rrr:yyy:zzz
xxx:ggg
;
run;

data temp;
   set NewData;
   retain c 1;
   do i = 1 to countw(ReasonCode, ':');
      w = scan(ReasonCode, i, ':');
      output;
   end;
run;

proc transpose data=temp out=want(drop=_:);
   by ReasonCode notsorted;
   id w;
   var c;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Doyleuk1983 · Posted 11-09-2020 11:29 AM

If I were to add in further data as follows:
Data NewData;
length RSAP_ReasonCode $255.;
Input RSAP_REASONCODE $;
datalines;
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
ddd:rrr:yyy:zzz
xxx:ggg
aaa:bbb:ccc:ddd:eee:fff
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:hhh
ddd:rrr:yyy:zzz
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:ggg
;
run;

the solution doesn't work and produces:

ERROR: The ID value "aaa" occurs twice in the same BY group.
ERROR: The ID value "bbb" occurs twice in the same BY group.
ERROR: The ID value "ccc" occurs twice in the same BY group.
ERROR: The ID value "ddd" occurs twice in the same BY group.
ERROR: The ID value "eee" occurs twice in the same BY group.
ERROR: The ID value "ggg" occurs twice in the same BY group.

ballardw · Posted 11-09-2020 11:50 AM

@Doyleuk1983 wrote:
If I were to add in further data as follows:
Data NewData;
length RSAP_ReasonCode $255.;
Input RSAP_REASONCODE $;
datalines;
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
ddd:rrr:yyy:zzz
xxx:ggg
aaa:bbb:ccc:ddd:eee:fff
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:hhh
ddd:rrr:yyy:zzz
aaa:bbb:ccc:ddd:eee
eee:aaa:fff:mmm
aaa:bbb:ccc:ddd:eee:ggg
aaa:bbb:ccc:ddd:eee:ggg
;
run;

the solution doesn't work and produces:

ERROR: The ID value "aaa" occurs twice in the same BY group.
ERROR: The ID value "bbb" occurs twice in the same BY group.
ERROR: The ID value "ccc" occurs twice in the same BY group.
ERROR: The ID value "ddd" occurs twice in the same BY group.
ERROR: The ID value "eee" occurs twice in the same BY group.
ERROR: The ID value "ggg" occurs twice in the same BY group.

You cannot have duplicate ID for BY values unless you use the LET option on the procedure.

Not going to say that will fix everything but the errors will go away and you can examine the outputl

Split out Delimiter Character String and count