Hi there,
I need your kind help to split a string available in one column of my data into multiple rows based on different bullets such A, B, 1, 2, A., B., C. , A), B), A:, B:, A- and B-. Finally I want to assign an identifier to each row.
DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
DATA want;
length text $100 ;
id=1; text="1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=2; text="2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=3; text="3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=4; text="4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=5; text="A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=6; text="B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=7; text="A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=8; text="B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
;
RUN;Thank you in advance for your kind reply.
And this question: Is it possible to have A. A- or A: in the text as well as the first character set delimiter?
If the answer to @Reeza's question about A. A- A: is yes, Here's a question that might mitigate the problem
Do all desired text items end with a period?
Then you need to provide a more robust example data set that has cases like that to ensure any solution works for your actual data.
I don't know that there will be a robust way to solve this given everything you've said though, except for manually and even then it's hard because the rules are ambiguous.
DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
data temp;
set have;
n+1;
pid=prxparse('/\b\w\W/');
s=1;e=length(text);
call prxnext(pid,s,e,text,p,l);
do id=1 by 1 while(p>0);
want=substr(text,p,l);output;
call prxnext(pid,s,e,text,p,l);
end;
keep n p text;
run;
data want;
merge temp temp(firstobs=2 keep=n p rename=(n=_n p=_p));
if n=_n then want=substr(text,p,_p-p);
else want=substr(text,p);
if first.n then id=0;
id+1;
keep n id want;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.