BookmarkSubscribeRSS Feed
DeepakSwain
Pyrite | Level 9

Hi there,

 

I need your kind help to split a string available in one column of my data into multiple rows based on different bullets such  A, B,  1, 2, A., B., C. , A), B), A:, B:, A- and B-. Finally I want to assign an identifier to each row. 

DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;


DATA want;
length text $100 ;
id=1; text="1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=2; text="2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=3; text="3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=4; text="4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=5; text="A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=6; text="B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=7; text="A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=8; text="B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
;
RUN;

Thank you in advance for your kind reply.

 

Swain
8 REPLIES 8
Reeza
Super User
Does your original text have line feeds in it to define the different rows, or is it only identifiable via he 1), 2), 3., 4., A- type output? Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

DeepakSwain
Pyrite | Level 9
Hi Reeza,
Thank you for your quick reply. There is no line feed or separate rows. Often all lines are available in a single or multiple paragraph format.
Swain
Reeza
Super User

And this question: Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

mkeintz
PROC Star

If the answer to @Reeza's question about A. A- A: is yes, Here's a question that might mitigate the problem

 

Do all desired text items end with a period?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

Then you need to provide a more robust example data set that has cases like that to ensure any solution works for your actual data.

 

I don't know that there will be a robust way to solve this given everything you've said though, except for manually and even then it's hard because the rules are ambiguous.  

 

Ksharp
Super User
DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;

data temp;
 set have;
 n+1;
 pid=prxparse('/\b\w\W/');
 s=1;e=length(text);
 call prxnext(pid,s,e,text,p,l);
 do id=1 by 1 while(p>0);
  want=substr(text,p,l);output;
  call prxnext(pid,s,e,text,p,l);
 end;
 keep n p text;
run;
data want;
 merge temp temp(firstobs=2 keep=n p rename=(n=_n p=_p));
 if n=_n then want=substr(text,p,_p-p);
  else want=substr(text,p);
 if first.n then id=0;
 id+1;
 keep n id want;
 run;