BookmarkSubscribeRSS Feed
keen_sas
Quartz | Level 8

Hi All

 

data have ;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

 

i want to split the entire text into multiple variables based on the square/Angular brackets ([), both start and close brackets along with the text present in the brackets  should be present in each variable as shown below. Tried with do loop and array, something missing in the loop . Can any one suggest how to perform this split.

 

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8
The subject experienced [AESER] adverse event of [AESEV] intensity, reported as [VBA]' ([LLT])  
[Samsung - the first major smartphone maker to release a foldable smartphone]              
School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected such as Facebook/Twitter use as early as preschool.

 

 

 

 

5 REPLIES 5
ballardw
Super User

@keen_sas wrote:

Hi All

 

data have ;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

 

i want to split the entire text into multiple variables based on the square/Angular brackets ([), both start and close brackets along with the text present in the brackets  should be present in each variable as shown below. Tried with do loop and array, something missing in the loop . Can any one suggest how to perform this split.

 

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8
The subject experienced [AESER] adverse event of [AESEV] intensity, reported as [VBA]' ([LLT])  
[Samsung - the first major smartphone maker to release a foldable smartphone]              
School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected such as Facebook/Twitter use as early as preschool.

 

 

 

 


You have some requirements that are moderately odd, such as why does [VBA]'  have the ' as part of the value, and why does ([LLT]) have both parentheses.

 

Perhaps if you describe how the result will actual be used we can make some additional suggestions.

 

This may help you get started but your rules for when to include the [] as part of the value need a lot of explanation.

Note, your example data set cuts off the third example row because you did not define a maximum length for TEXT and the first line sets the length of the text variable.

 

data have ;
length text $ 1000;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run;

data need (keep= row phrase);
  set have;
  row = _n_;
  length phrase $ 100;

  do i= 1 to countw(text,'[]');
   phrase = scan(text,i,'[]');
   output;
  end;
run;

proc transpose data=need
     out=trans prefix=var
     ;
by row;
var phrase;
run;
keen_sas
Quartz | Level 8

Thank you @ballardw @noling @andreas_lds for your quick response. The inconsistencies in the output which i have displayed is my typo mistake.But the concept is to split the text based on [] ( square brackets) and all of the 3 solutions are working fine.

 

The @ballardw solution is intuitive , but i have one query here.Since the code is using Scan function  phrase = scan(text,i,'[]');  the output is excluding the square brackets , the phrase (text) should be inclusive of square brackets as shown below.

Row Phrase Required Output
1 The subject experienced The subject experienced
1 AESER [AESER]
1  adverse event of  adverse event of
1 AESEV [AESEV]
1  intensity, reported as '  intensity, reported as '
1 VBA [VBA]
1 ' ( ' (
1 LLT [LLT]
1 ). ).
2 Samsung - the first major smartphone maker to release a foldable smartphone [Samsung - the first major smartphone maker to release a foldable smartphone]

 

The solutions provide good start to go ahead.

ballardw
Super User

@keen_sas wrote:

Thank you @ballardw @noling @andreas_lds for your quick response. The inconsistencies in the output which i have displayed is my typo mistake.But the concept is to split the text based on [] ( square brackets) and all of the 3 solutions are working fine.

 

The @ballardw solution is intuitive , but i have one query here.Since the code is using Scan function  phrase = scan(text,i,'[]');  the output is excluding the square brackets , the phrase (text) should be inclusive of square brackets as shown below.

Row Phrase Required Output
1 The subject experienced The subject experienced
1 AESER [AESER]
1  adverse event of  adverse event of
1 AESEV [AESEV]
1  intensity, reported as '  intensity, reported as '
1 VBA [VBA]
1 ' ( ' (
1 LLT [LLT]
1 ). ).
2 Samsung - the first major smartphone maker to release a foldable smartphone [Samsung - the first major smartphone maker to release a foldable smartphone]

 

The solutions provide good start to go ahead.


I can sort of see a potential use for the single "word" brackets, but without a real world description of why that last entire phrase is in [] when everything else is a single "word" I am not going to spend any time trying to parse an incomplete problem description, which was not helped by the earlier examples with the ([LLT]) and ' being included with the brackets.

noling
SAS Employee
data have ;
length text $1000;
text="The subject experienced [AESER] adverse event of [AESEV] intensity, reported as '[VBA]' ([LLT]).";
output;
text="[Samsung - the first major smartphone maker to release a foldable smartphone]";
output ;
text="School districts across the [COUNTRY] are on the cusp of integrating [new technology in K-12 classrooms] by such as [SOCIAL_MEDIA] use as early as preschool.exploring unexpected [CURRICULUM] such as Facebook/Twitter use as early as preschool.";
output;
run ;

data want;
	set have;
	length output_text $1000 char $1;
	array vars{100} $1000 var1-var100;
	output_text="";
	j=0; *number of output var;
	k=0; *char for output string;
	do i = 1 to 1000; *loop across original text;
		k + 1;
		char=substr(text,i,1);
		
		substr(output_text,k,1)=strip(char);
		if char = "[" and output_text ne "[" then do;
			j+1;
			vars{j}=substr(output_text,1,length(output_text)-1);
			output_text=strip("[");
			k=1;
		end;
		if char = "]" then do;
			j+1;
			k=0;
			vars{j}=strip(output_text);
			output_text="";
		end;
	end;
run;

Ballardw's code is probably more intuitive.


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

andreas_lds
Jade | Level 19

Some of the values seem to be inconsistent:

  • In the first line: why is the ' after [VBA] included in var6, but not the ' before the opening square bracket?
  • in the third line: why is [CURRICULUM]  not in a variable?

 

Here is another suggestion to solve the issue:

data work.intermediate;
    set have;
    
    length
        id 8
        part $ 400
        _start _stop _pos _len 8
    ;
    
    drop _: text;
    
    id = _n_;
    rx = prxparse('/(.*?)(\(?\[.+?\]\)?)/');
    _start = 1;
    _stop = length(text);
    
    put _stop=;
    
    call prxnext(rx, _start, _stop, text, _pos, _len);
    
    do while (_pos > 0);
        do i = 1 to 3;
            part = prxposn(rx, i, text);
            if not missing(part) then output;
        end;   
        call prxnext(rx, _start, _stop, text, _pos, _len);
    end;
    
    put _start= _stop= _len= _pos=;
    
    if _start < _stop then do;
        part = substr(text, _start);
        output;
    end;
    
run;


proc transpose data=work.intermediate out=work.want(drop=id _name_) prefix=var;
    by id;
    var part;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 764 views
  • 0 likes
  • 4 in conversation