Hi
I need a way to identify specific information.
I have a dataset:
company_name company_id brand_id brand_name word_count
The brand_name contains either the brand itself or a catchphrase that promotes the brand.
ABC Corp 1234 A888888 Vallina Soda 2
ABC Corp 1234 A666666 Diet Soda Sugar Free 4
ABC Corp 1234 A555555 Share a Soda Moment 4
ABC Corp 1234 A333333 SodaRewards 1
ABC Corp 1234 A222222 Soda 1
ABC Corp 1234 A000000 iSoda 1
ABC Corp 1234 A999999 Soda Original 2
ABC Corp 1234 B888888 Drink 1
ABC Corp 1234 A888888 Drink Multi flavored 3
I don't have a list of brands, so I need a way to identify brand keyword (e.g. Soda, Drink are the keywords)
So for 1 word brand_name, two possibilities: brand itself (Soda) or word containing the brand (SodaRewards, iSoda).
2-word brand_names, two possibilities: 2-word brand (Best Quality, Higher Standards) or phrase containing 1-word brand (Vanilla Soda, Soda Original).
3-word brand_names, three possibilities: 3-word brand (Best Fried Chicken), phrase containing 1-word brand (Drink Multi Flavored), or phrase containing 2-word brand (Best Quality Tools).
4-word brand_name, four possibilities: 4-word brand (The American's Finest Style), phrase containing 1-word brand, phrase containing 2-word brand, or phrase containing 3-word brand.
and so on....The longest string contains n words.
The search is done at the firm level.
Also, most brands are short, the longer strings are usually catchphrase promoting the brand. Catchphrase could contain the brand (obs.3) or not related to the brand. Those catchphrases not containing the brand can be treated as a separate brand.
I need a way to search for brand keywords using the algorithm, so that the output will be something like
company_name company_id brand_id brand_name keyword
ABC Corp 1234 A888888 Vallina Soda Soda
ABC Corp 1234 A666666 Diet Soda Sugar Free Soda
ABC Corp 1234 A555555 Share a Soda Moment Soda
ABC Corp 1234 A333333 SodaRewards Soda
ABC Corp 1234 A222222 Soda Soda
ABC Corp 1234 A000000 iSoda Soda
ABC Corp 1234 A999999 Soda Original Soda
ABC Corp 1234 B888888 Drink Drink
ABC Corp 1234 A888888 Drink Multi flavored Drink
Can some one guide how can i achieve this.
One way would be to add an array onto each record, namely an array that contains all combinations of your brand_name and, if you also want to be able to search regardless of order (e.g., both "vanilla soda" and "soda vanilla", permutations.
One way to do that is described in the thread:
In that thread, I posted a link to two SAS macros that can be used to get all combinations and permutations.
I, personally, would upcase all of the entries to that array.
Then you can simply do a search, upcasing the term and using the IN function.
Note: the link I provided to the combinations and permutations macros apparently is no longer on support.sas.com. However, you can still find them at: http://www.urz.uni-heidelberg.de/statistik/sas/doc/ts498-combperm.txt
Message was edited by: Arthur Tabachneck
You could create a list of keywords and use SQL with a CONTAINS condition:
data phrases;
length company_name $16 brand_id $10 brand_name $32;
input company_name & company_id brand_id $ brand_name & ;
datalines;
ABC Corp 1234 A888888 Vallina Soda
ABC Corp 1234 A666666 Diet Soda Sugar Free
ABC Corp 1234 A555555 Share a Soda Moment
ABC Corp 1234 A333333 SodaRewards
ABC Corp 1234 A222222 Soda
ABC Corp 1234 A000000 iSoda
ABC Corp 1234 A999999 Soda Original
ABC Corp 1234 B888888 Drink
ABC Corp 1234 A888888 Drink Multi flavored
;
data keywords;
input keyword :$12.;
datalines;
Soda
Drink
;
proc sql;
create table keyPhrases as
select p.*, k.keyword
from
phrases as p cross join
keywords as k
where upcase(p.brand_name) contains upcase(trim(k.keyword));
select * from keyPhrases;
quit;
PG
As pointed out, you need firstly define these keywords .
data phrases;
length company_name $16 brand_id $10 brand_name $32;
input company_name & company_id brand_id $ brand_name & ;
datalines;
ABC Corp 1234 A888888 Vallina Soda
ABC Corp 1234 A666666 Diet Soda Sugar Free
ABC Corp 1234 A555555 Share a Soda Moment
ABC Corp 1234 A333333 SodaRewards
ABC Corp 1234 A222222 Soda
ABC Corp 1234 A000000 iSoda
ABC Corp 1234 A999999 Soda Original
ABC Corp 1234 B888888 Drink
ABC Corp 1234 A888888 Drink Multi flavored
;
data want;
set phrases;
keyword=prxchange('s/.*(Soda|Drink).*/$1/io',-1,brand_name);
run;
Xia Keshan
Vaibhav,
You never did indicate whether PG's or KSharp's suggested code solved your problem. If they did, you really ought to mark the question as answered and give them points for providing correct and helpful responses.
While both PG's and KSharp's solutions are about as parsimonious as you can get, the question is whether they do what you need them to do.
The following is far from parsimonious, and may or may not do what you want. Both PG's and KSharps solutions could be expanded to do the same thing as the following code, but I've wanted to do something with the permutation's macro for quite some time.
The code, below, finds all permutations for the maximum number of strings found in the brand_name variable, and then adds all of those permutations into an array. Finally, a search macro lets one search the array to find any records that meet all of the strings identified in the macro's search parameter:
data have;
informat company_name $30.;
informat brand_id $7.;
informat brand_name $30.;
input company_name & company_id brand_id brand_name & word_count;
cards;
ABC Corp 1234 A888888 Vanilla Soda 2
ABC Corp 1234 A666666 Diet Soda Sugar Free 4
ABC Corp 1234 A555555 Share a Soda Moment 4
ABC Corp 1234 A333333 SodaRewards 1
ABC Corp 1234 A222222 Soda 1
ABC Corp 1234 A000000 iSoda 1
ABC Corp 1234 A999999 Soda Original 2
ABC Corp 1234 B888888 Drink 1
ABC Corp 1234 A888888 Drink Multi flavored 3
;
%macro permute(r) / parmbuff; /* the parmbuff option assigns */
%let i=2; /* the invocation parameter list to the */
%let things=; /* macro variable &syspbuff */
%do %while (%scan(&syspbuff,&i) ne ); /* scan the syspbuff */
%let p&i="%scan(&syspbuff,&i)"; /* to determine r */
%if &i=2 %then %let things=&&p&i; /* and count the number */
%else %let things=&things,&&p&i; /* of elements, n */
%let i=%eval(&i+1);
%end;
%let n=%eval(&i-2);
data permute;
drop i j copy;
wordcnt=&r.;
array check (*) r1-r&r; /* create a total of r */
%do m=1 %to &r; /* variables for looping */
do r&m = &things;
%end;
copy=0;
do i=2 to &r; /* look for duplicate items */
do j=1 to i-1; /* and keep the unique ones */
if check(j)=check(i) then copy+1;
end;
end;
if copy = 0 then output; /* writes to a SAS data set */
if copy = 0 then put r1-r&r; /* writes to the log */
%do m=1 %to &r;
end; /* end the r DO LOOPS */
%end;
run;
%mend permute;
proc sql noprint;
select max(word_count)
into :max_count
from have
;
quit;
data _null_;
length words $32767;
do j=1 to &max_count.;
words=catx(',',words,j);
end;
call symput('words',words);
run;
%macro getperms;
%do cntr=&max_count. %to 1 %by -1;
%let wordsplus=&cntr.,&words.;
%permute(&wordsplus.)
%if &cntr. eq &max_count. %then %do;
data need;
set permute;
run;
%end;
%else %do;
proc append base=need data=permute nowarn force;
run;
%end;
%end;
data _null_;
length excmd $1000;
set need nobs=numobs end=lastone;
by descending wordcnt;
array permpos(&max_count.) r:;
if _n_ eq 1 then do;
call execute('data want; set have;');
excmd=catt('array words(',numobs,') $255.;');
call execute(excmd);
excmd=catt('array word_in(',&max_count.,') $255.;');
call execute(excmd);
call execute("_n_=1; do while (scan(brand_name,_n_) ne '');");
call execute('word_in(_n_)=upcase(scan(brand_name,_n_));_n_+1;end;');
call execute('wordcntr=_n_-1; _n_=1;warraycnt=0;');
end;
highest=max(of permpos(*));
if first.wordcnt then do;
excmd=catx(' ','if',wordcnt,'le wordcntr then do;');
call execute(excmd);
end;
excmd=catx(' ','if',highest,'le wordcntr then do;');
call execute(excmd);
excmd=catt('warraycnt+1; words(warraycnt)=word_in(',permpos(1),');');
call execute(excmd);
do i=2 to wordcnt;
excmd=catt("words(warraycnt)=catx(' ',words(warraycnt),word_in(",permpos(i),'));');
call execute(excmd);
end;
call execute('end;');
if last.wordcnt then call execute('end;');
if lastone then call execute('run;');
run;
%mend;
%getperms
%macro search(searchstring);
data result (drop=words: word_: warraycnt wordcntr);
set want;
array strings(*) $255. words:;
do _n_=1 to dim(strings);
if upcase("&searchstring.") eq strings(_n_) then do;
output;
leave;
end;
end;
run;
%mend;
/*test cases*/
%search(soda)
%search(share soda moment)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.