Splitting one row to multiple rows and assigning separate identifier t...

DeepakSwain · Posted 01-16-2020 01:57 PM

Hi there,

I need your kind help to split a string available in one column of my data into multiple rows based on different bullets such A, B, 1, 2, A., B., C. , A), B), A:, B:, A- and B-. Finally I want to assign an identifier to each row.

DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;


DATA want;
length text $100 ;
id=1; text="1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=2; text="2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=3; text="3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=4; text="4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=5; text="A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
id=6; text="B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS." ; output;
id=7; text="A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND."; output;
id=8; text="B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."; output;
;
RUN;

Thank you in advance for your kind reply.

Swain

Reeza · Posted 01-16-2020 02:01 PM

Does your original text have line feeds in it to define the different rows, or is it only identifiable via he 1), 2), 3., 4., A- type output? Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

DeepakSwain · Posted 01-16-2020 02:18 PM

Hi Reeza,
Thank you for your quick reply. There is no line feed or separate rows. Often all lines are available in a single or multiple paragraph format.

Swain

Reeza · Posted 01-16-2020 02:56 PM

And this question: Is it possible to have A. A- or A: in the text as well as the first character set delimiter?

mkeintz · Posted 01-16-2020 03:13 PM

If the answer to @Reeza's question about A. A- A: is yes, Here's a question that might mitigate the problem

Do all desired text items end with a period?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

DeepakSwain · Posted 01-17-2020 08:15 AM

Not always ends with period

Swain

DeepakSwain · Posted 01-17-2020 08:13 AM

Yes

Swain

Reeza · Posted 01-17-2020 11:44 AM

Then you need to provide a more robust example data set that has cases like that to ensure any solution works for your actual data.

I don't know that there will be a robust way to solve this given everything you've said though, except for manually and even then it's hard because the rules are ambiguous.

Ksharp · Posted 01-18-2020 12:00 PM

DATA have;
text= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A- ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B- ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A: POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B: POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;

data temp;
 set have;
 n+1;
 pid=prxparse('/\b\w\W/');
 s=1;e=length(text);
 call prxnext(pid,s,e,text,p,l);
 do id=1 by 1 while(p>0);
  want=substr(text,p,l);output;
  call prxnext(pid,s,e,text,p,l);
 end;
 keep n p text;
run;
data want;
 merge temp temp(firstobs=2 keep=n p rename=(n=_n p=_p));
 if n=_n then want=substr(text,p,_p-p);
  else want=substr(text,p);
 if first.n then id=0;
 id+1;
 keep n id want;
 run;

Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Re: Splitting one row to multiple rows and assigning separate identifier to each row

Registration is open