topic Re: String Parsing - how to extract words between comma and dot in SAS Programming

String Parsing - how to extract words between comma and dot

J111 — Wed, 22 Jan 2025 09:13:43 GMT

Hello,

Please find below data Have and data Want.

Data Have ;
input data $100. ;
cards ;
data/dataflow/BPS.STQR/EZR/1.0/RER_QWE_MED.D.AUD.MED, RER_QWA_MED.D.CAD.WOW.WOW.OF00, QWZ_PLO_WOW.D.POP
;
Run ;

Data Want ;
input x1 $60. ;
cards ;
RER_QWE_MED , RER_QWA_MED , QWZ_PLO_WOW
;
Run ;

We would like to extract X1 from the data according to the following rules for Type A and Type B:

Type A

1. it appears after the last back slash in the sentence

2. and also before the first dot following this word (if there is a dot)

3. and this word contains only capital letters or underscore in it (without small letters)

4. if such a word is not found write 'NOWORD'

Type B

here we have to extract more than one word and the logic is as follows:

1. it appears after the last back slash in the sentence

2. continue in the same row until we find a comma and then take the word which is between the comma and the next first dot.

recursively look for all the words between comma and next dot (as long as we are after the last backslash in the row).

the purpose is to clean relevant words from garbage, for example '.CAD.' is garbage.

for type A we have the solution

x1 = scan(scan(data,-1,"/"),1,'.') ; *after last back slash and before first period ;

if count(x1,lowcase(x1)) = 1 then x1 = 'NOWORD' ;

We need a solution for type B

best solution would handle both type A and type B in the same row.

------------------------------------------------------

BTW Next step we split the want data by delims.

Data Audit_Split ;

set Audit_short ;

delims = '+,' ;

Array s_ [40] $55 s1-s40 ;

do i = 1 to 40 ;

s_[i] = left(scan(string, i, delims)) ;

end ;

drop delims i ;

Run ;

----------------------------------------------------------------------------------------------------------

Thanks in advance.

Re: String Parsing - how to extract words between comma and dot

J111 — Wed, 22 Jan 2025 09:04:22 GMT

Hello,

Seems like this can work for TYPE B,

Data next ;

set have ;

xx = scan(data,-1,"/") ;

Array x_ [40] $55 x1-x40 ;

do i = 1 to 40 ;

x_[i] = scan(scan(xx,-i,','),1,'.') ; ;

end ;

drop i xx data ;

Run ;

Re: String Parsing - how to extract words between comma and dot

Patrick — Wed, 22 Jan 2025 23:52:37 GMT

@J111
You've got already a solution so below just for your consideration.

You wrote: "best solution would handle both type A and type B in the same row".

I made in below code the assumption that also type B should only select terms that only contain upper case letters and the underscore. If that's the case then you can "merge" your two type into a single rule.

Below should return your desired result based on my understanding of your rules.

Data Have ;
	infile datalines4 truncover;
	input source_string $600. ;
datalines4 ;
/availability/dataflow/*/*/*/*/-
data/dataflow/BPS.STQR/WAB/1.0/NER_PGTABC
/data/dataflow/BPS.STQR/EZR/1.0/RER_QWE_MED.D.AUD.MED.MED.OF00,RER_QWA_MED.D.CAD.WOW.WOW.OF00,QWZ_PLO_WOW.D.POP.WOW.MED.OF00
NOSLASH.AT.ALL
/SLASH_bUT_mIxed_Case.B.C
;;;;
Run;

data want(drop=_:);
	set have;
	/* substring after last forward slash if any, else just whole string */
	_s=scan(source_string,-1,'/');
	
	/* within _s iterate for each substring that ends with a comma or is the last substring */
	do _i=1 to countc(_s,',')+1;
		length _word $50 words $200;
		/* extract the first sub-string that ends with a full stop else the whole string */
		_word=scan(scan(_s,_i,','),1,'.');
		/* check that _word only contains upper case characters or the underscore */
		if prxmatch('/^[A-Z_]+$/',strip(_word))=1 then
			do;
				words=catx(',',words,_word);
			end;
	end;
	if missing(words) then words='NOWORD';
run;

proc print data=want;
run;