03-05-2017 10:21 PM - edited 03-05-2017 10:23 PM
I have a list of several hundred Keywords & short phrases 2 or 3 words together.
I also have a long list of unstructured text where I want to know if the Keywords & phrases appear exactly as I've got in the list.
If they do appear, I want to export the ID field & then export the Keywords/Phrases out delimited by a comma on the same row as the ID.
I want to export to a new table in the existing Project.
I've had a look at some functions but maybe a Data Step Loop would be the most effective method?
Can anyone assist with the Data Step code I would need to do this or suggest a better solution?
I'm using SAS EG Vers 7.13 64 bit.
03-05-2017 10:45 PM
Look for hash iterate method within a datastep, using functions like find or index.
It would be more helpful if you post some test examples of your key words and of your text dataset.
03-05-2017 11:33 PM - edited 03-05-2017 11:37 PM
I've posted a quick Create table & Insert statement to help.
Note, I've added some punctuation to show that the data is a bit messy but I'm just after the words/phrases regardless of punctuation or case.
I want to find & extract the words/phrases in the 'TEXT TO FIND' table from the UNSTRUCTURED TEXT' Table.
I was thinking of a Loop that runs the first word/phrase from the 'TEXT TO FIND' table through all the records in the UNSTRUCTURED TEXT' Table extract the matches & do the same with the next word/phrase.
proc sql; create table RDW.UNSTRUCTURED_TEXT (Check_String char(3000)); insert into RDW.UNSTRUCTURED_TEXT values('The quick brown dog, jumped over-the lazy dog') values('The QUICK GREEN dog jumped over, the lazy dog') values('The quick snake slithered over the lazy dog') values('The dog quickly Jumped over the lazy Fox') values('The Quick; brown_dog! jumped over the Lazy dog'); title 'RDW.UNSTRUCTURED_TEXT'; select * from RDW.UNSTRUCTURED_TEXT; proc sql; create table RDW.TEXT_TO_FIND (Check_String char(300)); insert into RDW.TEXT_TO_FIND values('QUICK GREEN dog') values('quickly Jumped over') values('slithered over the lazy dog') values('The Quick;') values('snake'); title 'RDW.TEXT_TO_FIND'; select * from RDW.TEXT_TO_FIND; proc printto; run;
03-06-2017 05:02 PM
It's good but a little beyond my data Step ken at the moment.
Could I get a quick example using my posted data?
I need to crawl before I can walk. : )
03-06-2017 07:04 PM
Your data sets did not have a ID variables so I don't know which one was supposed to. I put one into one of the sets, this approach would then just select the correct id. I'm not exactly sure what your desired output is. This may be a start.
proc sql; create table work.UNSTRUCTURED_TEXT (id int, Check_String char(3000)); insert into work.UNSTRUCTURED_TEXT values(1,'The quick brown dog, jumped over-the lazy dog') values(2,'The QUICK GREEN dog jumped over, the lazy dog') values(3,'The quick snake slithered over the lazy dog') values(4,'The dog quickly Jumped over the lazy Fox') values(5,'The Quick; brown_dog! jumped over the Lazy dog'); select id,check_string from work.UNSTRUCTURED_TEXT; quit; proc sql; create table work.TEXT_TO_FIND (Check_String char(300)); insert into work.TEXT_TO_FIND values('QUICK GREEN dog') values('quickly Jumped over') values('slithered over the lazy dog') values('The Quick;') values('snake'); quit; proc sql; create table want as select a.id,a.check_string, b.check_string as Found_text from work.UNSTRUCTURED_TEXT as a,work.TEXT_TO_FIND as b where index(a.check_string,strip(b.check_string))>0 order by a.id; quit;
I will make zero claim for effeciency other than in code maitenance or readability.
Note: Putting YOUR libraries into code can be very problematic...
03-07-2017 04:50 PM
I think this is pretty close:
input SearchPhrase $50.;
QUICK GREEN dog
quickly Jumped over
slithered over the lazy dog
input Comment $50.;
SeqNo = _n_;
The quick brown dog, jumped over-the lazy dog
The QUICK GREEN dog jumped over, the lazy dog
The quick snake slithered over the lazy dog
The dog quickly Jumped over the lazy Fox
The Quick; brown_dog! jumped over the Lazy dog
data Inter1(keep=SeqNo SearchPhrase);
if _n_ = 1 then do;
if 0 then
declare hash P(dataset:'PhraseList', multidata:'y');
declare hiter iter("P");
rc = iter.first();
do while (rc=0);
if find(Comment, SearchPhrase, 't') then output;
rc = iter.next();
data Want(keep=SeqNo SearchResults);
length SearchResults $10000;
then SearchResults = SearchPhrase;
else SearchResults = catx(",",SearchResults, SearchPhrase);
if last.SeqNo then output;
proc datasets lib=work nolist;