- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Forum,
I have a list of several hundred Keywords & short phrases 2 or 3 words together.
I also have a long list of unstructured text where I want to know if the Keywords & phrases appear exactly as I've got in the list.
If they do appear, I want to export the ID field & then export the Keywords/Phrases out delimited by a comma on the same row as the ID.
I want to export to a new table in the existing Project.
I've had a look at some functions but maybe a Data Step Loop would be the most effective method?
Can anyone assist with the Data Step code I would need to do this or suggest a better solution?
I'm using SAS EG Vers 7.13 64 bit.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Look for hash iterate method within a datastep, using functions like find or index.
It would be more helpful if you post some test examples of your key words and of your text dataset.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shmuel,
I've posted a quick Create table & Insert statement to help.
Note, I've added some punctuation to show that the data is a bit messy but I'm just after the words/phrases regardless of punctuation or case.
I want to find & extract the words/phrases in the 'TEXT TO FIND' table from the UNSTRUCTURED TEXT' Table.
I was thinking of a Loop that runs the first word/phrase from the 'TEXT TO FIND' table through all the records in the UNSTRUCTURED TEXT' Table extract the matches & do the same with the next word/phrase.
Cheers
proc sql; create table RDW.UNSTRUCTURED_TEXT (Check_String char(3000)); insert into RDW.UNSTRUCTURED_TEXT values('The quick brown dog, jumped over-the lazy dog') values('The QUICK GREEN dog jumped over, the lazy dog') values('The quick snake slithered over the lazy dog') values('The dog quickly Jumped over the lazy Fox') values('The Quick; brown_dog! jumped over the Lazy dog'); title 'RDW.UNSTRUCTURED_TEXT'; select * from RDW.UNSTRUCTURED_TEXT; proc sql; create table RDW.TEXT_TO_FIND (Check_String char(300)); insert into RDW.TEXT_TO_FIND values('QUICK GREEN dog') values('quickly Jumped over') values('slithered over the lazy dog') values('The Quick;') values('snake'); title 'RDW.TEXT_TO_FIND'; select * from RDW.TEXT_TO_FIND; proc printto; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This may give you some ideas:
http://support.sas.com/resources/papers/proceedings14/1717-2014.pdf
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks TomKari,
It's good but a little beyond my data Step ken at the moment.
Could I get a quick example using my posted data?
I need to crawl before I can walk. : )
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your data sets did not have a ID variables so I don't know which one was supposed to. I put one into one of the sets, this approach would then just select the correct id. I'm not exactly sure what your desired output is. This may be a start.
proc sql; create table work.UNSTRUCTURED_TEXT (id int, Check_String char(3000)); insert into work.UNSTRUCTURED_TEXT values(1,'The quick brown dog, jumped over-the lazy dog') values(2,'The QUICK GREEN dog jumped over, the lazy dog') values(3,'The quick snake slithered over the lazy dog') values(4,'The dog quickly Jumped over the lazy Fox') values(5,'The Quick; brown_dog! jumped over the Lazy dog'); select id,check_string from work.UNSTRUCTURED_TEXT; quit; proc sql; create table work.TEXT_TO_FIND (Check_String char(300)); insert into work.TEXT_TO_FIND values('QUICK GREEN dog') values('quickly Jumped over') values('slithered over the lazy dog') values('The Quick;') values('snake'); quit; proc sql; create table want as select a.id,a.check_string, b.check_string as Found_text from work.UNSTRUCTURED_TEXT as a,work.TEXT_TO_FIND as b where index(a.check_string,strip(b.check_string))>0 order by a.id; quit;
I will make zero claim for effeciency other than in code maitenance or readability.
Note: Putting YOUR libraries into code can be very problematic...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks ballardw,
Very useful!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think this is pretty close:
data PhraseList;
input SearchPhrase $50.;
cards4;
QUICK GREEN dog
quickly Jumped over
slithered over the lazy dog
The Quick;
snake
;;;;
run;
data Have;
input Comment $50.;
SeqNo = _n_;
cards4;
The quick brown dog, jumped over-the lazy dog
The QUICK GREEN dog jumped over, the lazy dog
The quick snake slithered over the lazy dog
The dog quickly Jumped over the lazy Fox
The Quick; brown_dog! jumped over the Lazy dog
;;;;
run;
data Inter1(keep=SeqNo SearchPhrase);
if _n_ = 1 then do;
if 0 then
set PhraseList;
declare hash P(dataset:'PhraseList', multidata:'y');
P.definekey('SearchPhrase');
P.definedone();
end;
declare hiter iter("P");
set Have;
rc = iter.first();
do while (rc=0);
if find(Comment, SearchPhrase, 't') then output;
rc = iter.next();
end;
run;
data Want(keep=SeqNo SearchResults);
length SearchResults $10000;
retain SearchResults;
set Inter1;
by SeqNo;
if first.SeqNo
then SearchResults = SearchPhrase;
else SearchResults = catx(",",SearchResults, SearchPhrase);
if last.SeqNo then output;
run;
proc datasets lib=work nolist;
delete PhraseList;
delete Have;
delete Inter1;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks TomKari,
That's great
Much appreciated!