Hello community!
I would like some help with this. I somehow did it manually, but I would want to learn how to do it with a more sophisticated code.
I need to split this grid into columns (up to 12 columns), with the following conditions:
- If it is a "C" or a "-" it must go in a single column.
- If it is a number, example "03" "09" "23", it must go in a single column.
Example:
Obs | GRID | Column1 | Column2 | Column3 | Column4 | … | Column 12 |
1 | 13131313130706050403--01CCCCCCCC | 13 | 13 | 13 | 13 | ||
2 | CCCCCCCCCCCCCCCCCCCCCC | C | C | C | C | ||
3 | 13130234 | 13 | 13 | 02 | 34 | ||
4 | 13131354 | 13 | 13 | 13 | 54 | ||
5 | CC020101CCC01CC01CC01CC-- | C | C | 02 | 01 | ||
6 | --C20133 | - | - | C | 20 | ||
7 | C0201CC | C | 02 | 01 | C | ||
8 | C030201C | C | 03 | 02 | 01 |
Thanks in advance!!
Here's how I did it with part of your data:
data blah;
length x $200.;
x='13131313130706050403--01CCCCCCCC';output;
x='CCCCCCCCCCCCCCCCCCCCCC'; output;
x='13130234';output;
run;
data blah2;
set blah end=last;
n_c=count(x,'C');
n_dash=count(x,'-');
if ^missing(compress(x,'-C')) then n_num=length(compress(x,'-C'))/2;
else n_num=0;
retain n_columns;
if sum(n_c,n_dash,n_num)>n_columns then n_columns=sum(n_c,n_dash,n_num);
if last then call symputx('ncol',n_columns);
run;
data blah3;
set blah2;
array column {&ncol} $2.;
col=0;
indx=1;
do until (indx>length(x));
if substr(x,indx,1) in('C' '-') then do;
col=col+1;
column(col)=substr(x,indx,1);
indx=indx+1;
end;
else do;
col=col+1;
column(col)=substr(x,indx,2);
indx=indx+2;
end;
end;
drop col indx n_c n_dash n_num;
run;
The first parts are figuring out how large the array needs to be to hold all of the columns. The last part is filling in the array by incrementing a substring function through and increasing the index based on the character it comes across. This assumes all numbers are two digits.
Are all numbers two digits?
Hi!
Yes, as the ones between 1 and 9 have leading 0s.
Here's how I did it with part of your data:
data blah;
length x $200.;
x='13131313130706050403--01CCCCCCCC';output;
x='CCCCCCCCCCCCCCCCCCCCCC'; output;
x='13130234';output;
run;
data blah2;
set blah end=last;
n_c=count(x,'C');
n_dash=count(x,'-');
if ^missing(compress(x,'-C')) then n_num=length(compress(x,'-C'))/2;
else n_num=0;
retain n_columns;
if sum(n_c,n_dash,n_num)>n_columns then n_columns=sum(n_c,n_dash,n_num);
if last then call symputx('ncol',n_columns);
run;
data blah3;
set blah2;
array column {&ncol} $2.;
col=0;
indx=1;
do until (indx>length(x));
if substr(x,indx,1) in('C' '-') then do;
col=col+1;
column(col)=substr(x,indx,1);
indx=indx+1;
end;
else do;
col=col+1;
column(col)=substr(x,indx,2);
indx=indx+2;
end;
end;
drop col indx n_c n_dash n_num;
run;
The first parts are figuring out how large the array needs to be to hold all of the columns. The last part is filling in the array by incrementing a substring function through and increasing the index based on the character it comes across. This assumes all numbers are two digits.
Thank you so much!
Just to learn it properly. I don't understand this part of the code. What is it for?
if ^missing(compress(x,'-C')) then n_num=length(compress(x,'-C'))/2;
Thanks again!
@JeffMeyers > take the length of a missing character it still returns 1.
Use the lengthn() function.
I think that you need to explicitly show us what the entire result for row 6 might be.
Sorry, I manually modified that one in excel (leading 0 in the last "3" was missing). The correct grid for row 6 is:
6 | --C201303 | - | - | C | 20 | 13 | 03 |
Like this?
data WANT;
set HAVE;
array COL [12] $2;
POS=1;
do COLNO = 1 to 12 while (char(GRID,POS) ne ' ');
if '0' <= char(GRID,POS) <= '9' then do;
COL[COLNO] = substr(GRID,POS,2);
POS+2;
end;
else do;
COL[COLNO] = char(GRID,POS);
POS+1;
end;
end;
run;
GRID | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 | COL11 | COL12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
13131313130706050403--01CCCCCCCC | 13 | 13 | 13 | 13 | 13 | 07 | 06 | 05 | 04 | 03 | - | - |
CCCCCCCCCCCCCCCCCCCCCC | C | C | C | C | C | C | C | C | C | C | C | C |
13130234 | 13 | 13 | 02 | 34 | ||||||||
--C0234 | - | - | C | 02 | 34 |
It worked perfectly fine! Thanks!
Regular expression solution :
data have;
input grid : $50. ;
cards;
3131313130706050403--01CCCCCCCC
CCCCCCCCCCCCCCCCCCCCCC
13130234
13131354
CC020101CCC01CC01CC01CC--
--C20133
C0201CC
C030201C
;
run;
data want(rename=(grid_=grid));
length grid_ $50;
set have;
/*REGEX pattern explained*/
/*/(\d{2}|\w{1}|\W{1})/io*/
/*1st Capturing Group (\d{2}|\w{1}|\W{1})*/
/*1st Alternative \d{2}*/
/*\d{2} matches a digit (equal to [0-9])*/
/*{2} Quantifier — Matches exactly 2 times*/
/*2nd Alternative \w{1}*/
/*\w{1} matches any word character (equal to [a-zA-Z0-9_])*/
/*{1} Quantifier — Matches exactly one time (meaningless quantifier)*/
/*3rd Alternative \W{1}*/
/*\W{1} matches any non-word character (equal to [^a-zA-Z0-9_])*/
/*{1} Quantifier — Matches exactly one time */
patternID = prxparse('/\d{2}|\w{1}|\W{1}/io');
array col[*] $ col1-col12;
grid_=grid;
do i=1 to length(grid);
call prxsubstr(patternID,grid,pos,len);
col[i]=substr(grid,pos,len);
grid=substr(grid,pos+len);
if i eq 12 then leave;
end;
drop grid patternID pos len i;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.