Desktop productivity for business analysts and programmers

Find Keywords or Phrases inside a block of unstructured text

Reply
Contributor
Posts: 65

Find Keywords or Phrases inside a block of unstructured text

[ Edited ]

Hi Forum,

 

I have a list of several hundred Keywords & short phrases 2 or 3 words together.

 

I also have a long list of unstructured text where I want to know if the Keywords & phrases appear exactly as I've got in the list.

 

If they do appear, I want to export the ID field & then export the Keywords/Phrases out delimited by a comma on the same row as the ID.

 

I want to export to a new table in the existing Project.

 

I've had a look at some functions but maybe a Data Step Loop would be the most effective method?

 

Can anyone assist with the Data Step code I would need to do this or suggest a better solution?

 

I'm using SAS EG Vers 7.13 64 bit.

 

Thanks 

Trusted Advisor
Posts: 1,459

Re: Find Keywords or Phrases inside a block of unstructured text

Look for hash iterate method within a datastep, using functions like find or index.

 

It would be more helpful if you post some test examples of your key words and of your text dataset.

Contributor
Posts: 65

Re: Find Keywords or Phrases inside a block of unstructured text

[ Edited ]

Hi Shmuel,

 

I've posted a quick Create table & Insert statement to help.

 

Note, I've added some punctuation to show that the data is a bit messy but I'm just after the words/phrases regardless of punctuation or case.

 

I want to find & extract the words/phrases in the 'TEXT TO FIND' table from the UNSTRUCTURED TEXT' Table.

 

I was thinking of a Loop that runs the first word/phrase from the 'TEXT TO FIND' table through all the records in the UNSTRUCTURED TEXT' Table extract the matches & do the same with the next word/phrase.  

 

Cheers

 

proc sql;
   create table RDW.UNSTRUCTURED_TEXT
       (Check_String char(3000));

insert into RDW.UNSTRUCTURED_TEXT
    values('The quick brown dog, jumped over-the lazy dog')
    values('The QUICK GREEN dog jumped over, the lazy dog')
    values('The quick snake slithered over  the lazy dog')
	values('The dog quickly Jumped over the lazy Fox')
	values('The Quick; brown_dog! jumped over the Lazy dog');

title 'RDW.UNSTRUCTURED_TEXT';

select *
   from RDW.UNSTRUCTURED_TEXT;

proc sql;
   create table RDW.TEXT_TO_FIND
       (Check_String char(300));

insert into RDW.TEXT_TO_FIND
    values('QUICK GREEN dog')
    values('quickly Jumped over')
    values('slithered over  the lazy dog')
	values('The Quick;')
	values('snake');

title 'RDW.TEXT_TO_FIND';

select *
   from RDW.TEXT_TO_FIND;

proc printto; run;

 

PROC Star
Posts: 1,143

Re: Find Keywords or Phrases inside a block of unstructured text

Contributor
Posts: 65

Re: Find Keywords or Phrases inside a block of unstructured text

Thanks TomKari,

 

It's good but a little beyond my data Step ken at the moment.

 

Could I get a quick example using my posted data?

 

I need to crawl before I can walk.  :  )

Super User
Posts: 11,104

Re: Find Keywords or Phrases inside a block of unstructured text

Your data sets did not have a ID variables so I don't know which one was supposed to. I put one into one of the sets, this approach would then just select the correct id. I'm not exactly sure what your desired output is. This may be a start.

 

proc sql;
   create table work.UNSTRUCTURED_TEXT
       (id int, Check_String char(3000));

   insert into work.UNSTRUCTURED_TEXT
    values(1,'The quick brown dog, jumped over-the lazy dog')
    values(2,'The QUICK GREEN dog jumped over, the lazy dog')
    values(3,'The quick snake slithered over  the lazy dog')
	 values(4,'The dog quickly Jumped over the lazy Fox')
	 values(5,'The Quick; brown_dog! jumped over the Lazy dog');
  select id,check_string 
  from work.UNSTRUCTURED_TEXT;
quit;

proc sql;
   create table work.TEXT_TO_FIND
       (Check_String char(300));

insert into work.TEXT_TO_FIND
    values('QUICK GREEN dog')
    values('quickly Jumped over')
    values('slithered over  the lazy dog')
	values('The Quick;')
	values('snake');

quit;

proc sql;
   create table want as
   select a.id,a.check_string, b.check_string as Found_text
   from work.UNSTRUCTURED_TEXT as a,work.TEXT_TO_FIND as b
   where index(a.check_string,strip(b.check_string))>0
   order by a.id;
quit;

 

I will make zero claim for effeciency other than in code maitenance or readability.

Note: Putting YOUR libraries into code can be very problematic...

Contributor
Posts: 65

Re: Find Keywords or Phrases inside a block of unstructured text

Thanks ballardw,

 

Very useful!

PROC Star
Posts: 1,143

Re: Find Keywords or Phrases inside a block of unstructured text

I think this is pretty close:

 

data PhraseList;

input SearchPhrase $50.;

cards4;

QUICK GREEN dog

quickly Jumped over

slithered over the lazy dog

The Quick;

snake

;;;;

run;

data Have;

input Comment $50.;

SeqNo = _n_;

cards4;

The quick brown dog, jumped over-the lazy dog

The QUICK GREEN dog jumped over, the lazy dog

The quick snake slithered over the lazy dog

The dog quickly Jumped over the lazy Fox

The Quick; brown_dog! jumped over the Lazy dog

;;;;

run;

data Inter1(keep=SeqNo SearchPhrase);

if _n_ = 1 then do;

if 0 then

set PhraseList;

declare hash P(dataset:'PhraseList', multidata:'y');

P.definekey('SearchPhrase');

P.definedone();

end;

declare hiter iter("P");

set Have;

rc = iter.first();

do while (rc=0);

if find(Comment, SearchPhrase, 't') then output;

rc = iter.next();

end;

run;

data Want(keep=SeqNo SearchResults);

length SearchResults $10000;

retain SearchResults;

set Inter1;

by SeqNo;

if first.SeqNo

then SearchResults = SearchPhrase;

else SearchResults = catx(",",SearchResults, SearchPhrase);

if last.SeqNo then output;

run;

proc datasets lib=work nolist;

delete PhraseList;

delete Have;

delete Inter1;

quit;

Contributor
Posts: 65

Re: Find Keywords or Phrases inside a block of unstructured text

Thanks TomKari,

 

That's great

 

Much appreciated!

Ask a Question
Discussion stats
  • 8 replies
  • 241 views
  • 0 likes
  • 4 in conversation