BookmarkSubscribeRSS Feed
DeepakSwain
Pyrite | Level 9

Hi there,

 

I need your kind help to split a string available in one column of my data into multiple rows based on different bullets such  A, B,  1, 2, A., B., C. , A), B), A:, B:, A- and B-. Finally I want to assign an identifier to each row. 

DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;


DATA want;
length text $100 ;
id=1; text="1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=2; text="2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=3; text="3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=4; text="4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=5; text="A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=6; text="B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=7; text="A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=8; text="B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
;
RUN;

Thank you in advance for your kind reply.

 

Swain
8 REPLIES 8
Reeza
Super User
Does your original text have line feeds in it to define the different rows, or is it only identifiable via he 1), 2), 3., 4., A- type output? Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

DeepakSwain
Pyrite | Level 9
Hi Reeza,
Thank you for your quick reply. There is no line feed or separate rows. Often all lines are available in a single or multiple paragraph format.
Swain
Reeza
Super User

And this question: Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

mkeintz
PROC Star

If the answer to @Reeza's question about A. A- A: is yes, Here's a question that might mitigate the problem

 

Do all desired text items end with a period?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

Then you need to provide a more robust example data set that has cases like that to ensure any solution works for your actual data.

 

I don't know that there will be a robust way to solve this given everything you've said though, except for manually and even then it's hard because the rules are ambiguous.  

 

Ksharp
Super User
DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;

data temp;
 set have;
 n+1;
 pid=prxparse('/\b\w\W/');
 s=1;e=length(text);
 call prxnext(pid,s,e,text,p,l);
 do id=1 by 1 while(p>0);
  want=substr(text,p,l);output;
  call prxnext(pid,s,e,text,p,l);
 end;
 keep n p text;
run;
data want;
 merge temp temp(firstobs=2 keep=n p rename=(n=_n p=_p));
 if n=_n then want=substr(text,p,_p-p);
  else want=substr(text,p);
 if first.n then id=0;
 id+1;
 keep n id want;
 run;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1064 views
  • 3 likes
  • 4 in conversation