Hi there,
I need your kind help to split a string available in one column of my data into multiple rows based on different bullets such A, B, 1, 2, A., B., C. , A), B), A:, B:, A- and B-. Finally I want to assign an identifier to each row.
DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
DATA want;
length text $100 ;
id=1; text="1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=2; text="2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=3; text="3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=4; text="4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=5; text="A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=6; text="B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=7; text="A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=8; text="B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
;
RUN;
Thank you in advance for your kind reply.
And this question: Is it possible to have A. A- or A: in the text as well as the first character set delimiter?
If the answer to @Reeza's question about A. A- A: is yes, Here's a question that might mitigate the problem
Do all desired text items end with a period?
Then you need to provide a more robust example data set that has cases like that to ensure any solution works for your actual data.
I don't know that there will be a robust way to solve this given everything you've said though, except for manually and even then it's hard because the rules are ambiguous.
DATA have; text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED. 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND. A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED. B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND. B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED." ; RUN; data temp; set have; n+1; pid=prxparse('/\b\w\W/'); s=1;e=length(text); call prxnext(pid,s,e,text,p,l); do id=1 by 1 while(p>0); want=substr(text,p,l);output; call prxnext(pid,s,e,text,p,l); end; keep n p text; run; data want; merge temp temp(firstobs=2 keep=n p rename=(n=_n p=_p)); if n=_n then want=substr(text,p,_p-p); else want=substr(text,p); if first.n then id=0; id+1; keep n id want; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.