- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thank you all for contributing the community. I have learnt a lot since I joined.
Here's another issue I encountered. I have two datasets:
data HAVE; input name $; datalines; John Diana Diane Sally Doug David Dianna ; run; data KEYWORDS; input keywords $; datalines; ou vi oh ck ; run;
I want to search KEYWORDS.keywords in HAVE.name and leave the HAVE observations only containing the keywords.
AND also show the keywords used in the searching process.
Here is what I want
data want; input name $ keywords $; datalines; John oh Doug ou David vi ; run;
Here is just an example, but my actual keywords are more than 600, so I cannot put them all in the code.
Would you please share some codes and thoughts, please?
Thank you so much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is probably going to be slow for very large data sets ... but it works
proc sql;
create table want as select name,keywords
from have as a,keywords as b
where find(a.name,b.keywords,'t')>0;
quit;
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the code. It works well.
What if the keywords must be an individual word, instead of a part of the word, in the NAME
For example,
data HAVE; input name $char14.;; datalines; apple LLC appleLLC amazon Inc amazonInc apple Corp appleCorp amazonCorp ; run; data KEYWORDS; input keywords $; datalines; apple amazon ; run;
I want the KEYWORDS showing up in the NAME as an individual word (with space before or after), rather than within the NAME string as part of it.
data want; input name $char11. keywords $char10.; datalines; apple LLC apple apple Corp apple amazon Inc amazon ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@yanshuai wrote:
Thanks for the code. It works well.
What if the keywords must be an individual word, instead of a part of the word, in the NAME
For example,
data HAVE; input name $char14.;; datalines; apple LLC appleLLC amazon Inc amazonInc apple Corp appleCorp amazonCorp ; run; data KEYWORDS; input keywords $; datalines; apple amazon ; run;I want the KEYWORDS showing up in the NAME as an individual word (with space before or after), rather than within the NAME string as part of it.
data want; input name $char11. keywords $char10.; datalines; apple LLC apple apple Corp apple amazon Inc amazon ; run;
Seems like you didn't even try the code I provided with these different keywords. Please try it and you'll know the answer.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you do not want. to have the names with keywords that are integral to the same, then those names have to filtered out. Better to do it before the Proc SQL step.
The updated code will be like this
data HAVE;
input name $char14.;
if (find(trim(name),' '))>0;
datalines;
apple LLC
appleLLC
amazon Inc
amazonInc
apple Corp
appleCorp
amazonCorp
;
run;
data KEYWORDS;
input keywords $;
datalines;
apple
amazon
;
run;
proc sql;
create table want as select name,keywords
from have as a,keywords as b
where find(a.name,b.keywords,'t')>0;
;
quit;
The output will be
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What do you want if a name contains more than 1 keyword?
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @yanshuai
@PaigeMiller solutions does take care of the scenario with multiple occurrences of the keywords, is simple and comprehensible.
I have added more keywords like as follows
data HAVE;
input name $;
datalines;
John
Diana
Diane
Sally
Doug
David
Dianna
;
data KEYWORDS;
input keywords $;
datalines;
ou
vi
oh
ck
id
ia
an
;/* Following code by @PaigeMiller */
proc sql;
create table want as select name,keywords
from have as a,keywords as b
where find(a.name,b.keywords,'t')>0;
quit;
The output is like this
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the code.
What if the keywords must be an individual word, instead of a part of the word, in the NAME
For example,
data HAVE; input name $char14.;; datalines; Apple LLC AppleLLC Amazon Inc AmazonInc Apple Corp AppleCorp AmazonCorp ; run; data KEYWORDS; input keywords $; datalines; apple amazon ; run;
I want the KEYWORDS showing up in the NAME as an individual word (with space before or after), rather than within the NAME string as part of it.
data want; input name $char11. keywords $char10.; datalines; apple LLC apple apple Corp apple amazon Inc amazon ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you never want more than one match per name, then you can put the list of keyword values into an array of character values. Loop through the array looking for each sequential character expression in name. Stop when one is found, or you've exhausted the list.
data HAVE;
input name $;
datalines;
John
Diana
Diane
Sally
Doug
David
Dianna
run;
data KEYWORDS;
input keywords $;
datalines;
ou
vi
oh
ck
run;
proc sql noprint;
select quote(trim(keywords)) into :keywordlist separated by ',' from keywords;
quit;
%put &=keywordlist;
data want (drop=i);
set have;
array kw {&sqlobs} $2 _temporary_ (&keywordlist);
do i=1 to dim(kw) until (find(name,trim(kw{i}))); end;
if i<=&sqlobs then keyword=kw{i};
run;
The macrovar keywordlist created by proc sql in the above case is "ou","vi","oh","ck", which initializes values in the array kw. The macrovar SQLOBS is the number of entities generated by the select statement in proc sql.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. I love this FIND function.
What if the keywords must be an individual word, instead of a part of the word, in the NAME
For example,
data HAVE;
input name $char14.;;
datalines;
Apple LLC
AppleLLC
Amazon Inc
AmazonInc
Apple Corp
AppleCorp
AmazonCorp
;
run;
data KEYWORDS;
input keywords $;
datalines;
apple
amazon
;
run;
I want the KEYWORDS showing up in the NAME as an individual word (with space before or after), rather than within the NAME string as part of it.
data want; input name $char11. keywords $char10.; datalines; apple LLC apple apple Corp apple amazon Inc amazon ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Below two options how this could work.
data HAVE;
input name $;
datalines;
John
Diana
Diane
Sally
Doug
David
Dianna
;
data KEYWORDS;
input keywords $;
datalines;
ou
vi
oh
ck
id
;
/* example 1: return first match */
data want_1(drop=_:);
if _n_=1 then
do;
if 0 then set keywords(keep=keywords);
dcl hash h1(dataset:'keywords');
dcl hiter hh1('h1');
h1.defineKey('keywords');
h1.defineData('keywords');
h1.defineDone();
end;
call missing(of _all_);
set have;
_rc = hh1.first();
do while (_rc = 0);
if find(name,keywords,'it') then
do;
output;
leave;
end;
_rc = hh1.next();
end;
run;
/* example 2: return all matches */
data
want_2 (keep=row_id name match_flg)
match_2 (keep=row_id keywords);
;
if _n_=1 then
do;
if 0 then set keywords(keep=keywords);
dcl hash h1(dataset:'keywords');
dcl hiter hh1('h1');
h1.defineKey('keywords');
h1.defineData('keywords');
h1.defineDone();
end;
call missing(of _all_);
row_id=_n_;
match_flg=0;
set have;
_rc = hh1.first();
do while (_rc = 0);
if find(name,keywords,'it') then
do;
output match_2;
match_flg=1;
end;
_rc = hh1.next();
end;
output want_2;
/* if match_flg=1 then output want_2;*/
run;
proc sql;
select
l.*,
r.keywords
from want_2 l left join match_2 r
on l.row_id=r.row_id
order by row_id
;
quit;
SQL report using above code:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To accomodate the possibility of multiple keyword matches for a single name, you have to search for every keyword for every incoming name:
data HAVE;
input name $;
datalines;
John
Diana
Diane
Sally
Doug
David
Dianna
run;
data KEYWORDS;
input keywords $;
datalines;
ou
vi
oh
ck
run;
proc sql noprint;
select quote(trim(keywords)) into :keywordlist separated by ',' from keywords;
quit;
%put &=keywordlist;
data want (drop=i);
set have;
array kw {&sqlobs} $2 _temporary_ (&keywordlist);
do i=1 to dim(kw) ;
keyword=kw{i};
if find(name,trim(keyword)) then output;
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------