Splitting the text based on brackets using do loop

keen_sas · Posted 05-03-2019 09:51 AM

Hi All

data have ;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

i want to split the entire text into multiple variables based on the square/Angular brackets ([), both start and close brackets along with the text present in the brackets should be present in each variable as shown below. Tried with do loop and array, something missing in the loop . Can any one suggest how to perform this split.

VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	VAR8
The subject experienced	[AESER]	adverse event of	[AESEV]	intensity, reported as	[VBA]'	([LLT])
[Samsung - the first major smartphone maker to release a foldable smartphone]
School districts across the	[COUNTRY]	are on the cusp of integrating	[new technology in K-12 classrooms]	by such as	[SOCIAL_MEDIA]	use as early as preschool.exploring unexpected	such as Facebook/Twitter use as early as preschool.

ballardw · Posted 05-03-2019 10:45 AM

@keen_sas wrote:

Hi All

data have ;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

i want to split the entire text into multiple variables based on the square/Angular brackets ([), both start and close brackets along with the text present in the brackets should be present in each variable as shown below. Tried with do loop and array, something missing in the loop . Can any one suggest how to perform this split.

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8

The subject experienced [AESER] adverse event of [AESEV] intensity, reported as [VBA]' ([LLT])

[Samsung - the first major smartphone maker to release a foldable smartphone]

School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected such as Facebook/Twitter use as early as preschool.

You have some requirements that are moderately odd, such as why does [VBA]' have the ' as part of the value, and why does ([LLT]) have both parentheses.

Perhaps if you describe how the result will actual be used we can make some additional suggestions.

This may help you get started but your rules for when to include the [] as part of the value need a lot of explanation.

Note, your example data set cuts off the third example row because you did not define a maximum length for TEXT and the first line sets the length of the text variable.

data have ;
length text $ 1000;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run;

data need (keep= row phrase);
  set have;
  row = _n_;
  length phrase $ 100;

  do i= 1 to countw(text,'[]');
   phrase = scan(text,i,'[]');
   output;
  end;
run;

proc transpose data=need
     out=trans prefix=var
     ;
by row;
var phrase;
run;

keen_sas · Posted 05-06-2019 12:32 AM

Thank you @ballardw @noling @andreas_lds for your quick response. The inconsistencies in the output which i have displayed is my typo mistake.But the concept is to split the text based on [] ( square brackets) and all of the 3 solutions are working fine.

The @ballardw solution is intuitive , but i have one query here.Since the code is using Scan function phrase = scan(text,i,'[]'); the output is excluding the square brackets , the phrase (text) should be inclusive of square brackets as shown below.

Row	Phrase	Required Output
1	The subject experienced	The subject experienced
1	AESER	[AESER]
1	adverse event of	adverse event of
1	AESEV	[AESEV]
1	intensity, reported as '	intensity, reported as '
1	VBA	[VBA]
1	' (	' (
1	LLT	[LLT]
1	).	).
2	Samsung - the first major smartphone maker to release a foldable smartphone	[Samsung - the first major smartphone maker to release a foldable smartphone]

The solutions provide good start to go ahead.

ballardw · Posted 05-06-2019 11:08 AM

@keen_sas wrote:

Thank you @ballardw @noling @andreas_lds for your quick response. The inconsistencies in the output which i have displayed is my typo mistake.But the concept is to split the text based on [] ( square brackets) and all of the 3 solutions are working fine.

The @ballardw solution is intuitive , but i have one query here.Since the code is using Scan function phrase = scan(text,i,'[]'); the output is excluding the square brackets , the phrase (text) should be inclusive of square brackets as shown below.

Row Phrase Required Output

1 The subject experienced The subject experienced

1 AESER [AESER]

1 adverse event of adverse event of

1 AESEV [AESEV]

1 intensity, reported as ' intensity, reported as '

1 VBA [VBA]

1 ' ( ' (

1 LLT [LLT]

1 ). ).

2 Samsung - the first major smartphone maker to release a foldable smartphone [Samsung - the first major smartphone maker to release a foldable smartphone]

The solutions provide good start to go ahead.

I can sort of see a potential use for the single "word" brackets, but without a real world description of why that last entire phrase is in [] when everything else is a single "word" I am not going to spend any time trying to parse an incomplete problem description, which was not helped by the earlier examples with the ([LLT]) and ' being included with the brackets.

noling · Posted 05-03-2019 10:55 AM

data have ;
length text $1000;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

data want;
	set have;
	length output_text $1000 char $1;
	array vars{100} $1000 var1-var100;
	output_text="";
	j=0; *number of output var;
	k=0; *char for output string;
	do i = 1 to 1000; *loop across original text;
		k + 1;
		char=substr(text,i,1);
		
		substr(output_text,k,1)=strip(char);
		if char = "[" and output_text ne "[" then do;
			j+1;
			vars{j}=substr(output_text,1,length(output_text)-1);
			output_text=strip("[");
			k=1;
		end;
		if char = "]" then do;
			j+1;
			k=0;
			vars{j}=strip(output_text);
			output_text="";
		end;
	end;
run;

Ballardw's code is probably more intuitive.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

andreas_lds · Posted 05-03-2019 12:26 PM

Some of the values seem to be inconsistent:

In the first line: why is the ' after [VBA] included in var6, but not the ' before the opening square bracket?
in the third line: why is [CURRICULUM] not in a variable?

Here is another suggestion to solve the issue:

data work.intermediate;
    set have;
    
    length
        id 8
        part $ 400
        _start _stop _pos _len 8
    ;
    
    drop _: text;
    
    id = _n_;
    rx = prxparse('/(.*?)(\(?\[.+?\]\)?)/');
    _start = 1;
    _stop = length(text);
    
    put _stop=;
    
    call prxnext(rx, _start, _stop, text, _pos, _len);
    
    do while (_pos > 0);
        do i = 1 to 3;
            part = prxposn(rx, i, text);
            if not missing(part) then output;
        end;   
        call prxnext(rx, _start, _stop, text, _pos, _len);
    end;
    
    put _start= _stop= _len= _pos=;
    
    if _start < _stop then do;
        part = substr(text, _start);
        output;
    end;
    
run;


proc transpose data=work.intermediate out=work.want(drop=id _name_) prefix=var;
    by id;
    var part;
run;

Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Re: Splitting the text based on brackets using do loop

Registration is open

SAS Training: Just a Click Away