Solved: Data Parsing - How to extract specific words from a sentence by rules

J111 · Posted 01-08-2025 04:53 AM

Hello,

Please find below data Have and data Want.

We would like to extract X1 from the data according to the following rules:

1. it appears after the last back slash in the sentence

2. and also before the first dot following this word (if there is a dot)

3. and this word contains only capital letters or under score in it (without small letters)

4. and there is no garbage after this word (such as '..')

5. if it can not find a word that follows all theses rules than write NOWORD

We would like to extract X2 from the data according to the following rules:

1. it appears always after this word BPS.STQR/ or this word BPS.STQR,

in other words it appears after BPS.STQR and is surrounded with '/' or with ','

Thanks in advance

----------------------------------------------------------------------------------------------

Data have ;
input data $60. ;
cards ;
data/dataflow/BPS.STQR/WAB/1.0/NER_PGTABC
data/dataflow/wow/BPS.STQR,WAB,1.0/NER_QZW
data/dataflow/wow/BPS.STQR,WAB,1.0/NER_QZW ..
data/dataflow/wow/BPS.STQR,WAB,1.0/NER_QZW.ABCDEFG
/availability/dataflow/*/*/*/*/-
/availability

;
Run ;

Data Want ;
input x1 $10. x2 $7. ;
cards ;
NER_PGTABC WAB
NER_QZW WAB
NER_QZW WAB
NER_QZW WAB

NOWORD

;
Run ;

PaigeMiller · Posted 01-08-2025 07:40 AM

data want;
    set have;
    location = find(data,'BPS.STQR');
    if location>0 then x2=scan(substr(data,location+9),1);
run;

--
Paige Miller

View solution in original post

J111 · Posted 01-08-2025 06:11 AM

Hellow,

Seems I found a solution for calculating X1 - view data test:-

Would appreciate your help regarding X2..

Data test ;
set have ;
X1 = scan(scan(data,-1,"/"),1,'.') ;
if count(x1,lowcase(X1)) = 1 then X1 = 'NOWORD' ;
Run ;

quickbluefish · Posted 01-08-2025 08:04 AM

I'm curious why you find a match for X1 in the 3rd row of your input dataset - doesn't that break the rule of 'no garbage' like '..' after the word? Or is garbage OK as long as there's a space preceding it? Also, could you define garbage? Any non A-Z or underscore? It does seem like you might get a more robust solution using one of the PRX* functions.

J111 · Posted 01-08-2025 08:57 AM

A little clarification

The purpose is to clean the words from the garbage

as long as they have upercase or underscores.

Thanks

PaigeMiller · Posted 01-08-2025 07:40 AM

data want;
    set have;
    location = find(data,'BPS.STQR');
    if location>0 then x2=scan(substr(data,location+9),1);
run;

--
Paige Miller

Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Re: Data Parsing - How to extract specific words from a sentence by rules

Registration is open

SAS Training: Just a Click Away