BookmarkSubscribeRSS Feed
vaibhavwadhera_gmail_com
Calcite | Level 5

Hi


I need a way to identify specific information.

I have a dataset:
company_name company_id brand_id brand_name word_count

The brand_name contains either the brand itself or a catchphrase that promotes the brand. 

ABC Corp  1234  A888888  Vallina Soda                                 2
ABC Corp  1234  A666666  Diet Soda Sugar Free                  4
ABC Corp  1234  A555555  Share a Soda Moment                 4
ABC Corp  1234  A333333  SodaRewards                              1
ABC Corp  1234  A222222  Soda                                            1
ABC Corp  1234  A000000  iSoda                                           1
ABC Corp  1234  A999999  Soda Original                               2
ABC Corp  1234  B888888  Drink                                            1
ABC Corp  1234  A888888  Drink Multi flavored                      3

I don't have a list of brands, so I need a way to identify brand keyword (e.g. Soda, Drink are the keywords)

So for 1 word brand_name, two possibilities: brand itself (Soda) or word containing the brand (SodaRewards, iSoda). 

2-word brand_names, two possibilities: 2-word brand (Best Quality, Higher Standards) or phrase containing 1-word brand (Vanilla Soda, Soda Original).

3-word brand_names, three possibilities: 3-word brand (Best Fried Chicken), phrase containing 1-word brand (Drink Multi Flavored), or phrase containing 2-word brand  (Best Quality Tools).

4-word brand_name, four possibilities: 4-word brand (The American's Finest Style), phrase containing 1-word brand, phrase containing 2-word brand, or phrase containing 3-word brand.

and so on....The longest string contains n words.

The search is done at the firm level.

Also, most brands are short, the longer strings are usually catchphrase promoting the brand.  Catchphrase could contain the brand (obs.3) or not related to the brand.  Those catchphrases not containing the brand can be treated as a separate brand.

I need a way to search for brand keywords using the algorithm, so that the output will be something like

company_name  company_id  brand_id    brand_name                       keyword

ABC Corp           1234              A888888   Vallina Soda                       Soda
ABC Corp           1234              A666666    Diet Soda Sugar Free       Soda
ABC Corp           1234              A555555    Share a Soda Moment      Soda
ABC Corp           1234              A333333    SodaRewards                   Soda
ABC Corp           1234              A222222    Soda                                 Soda
ABC Corp           1234              A000000    iSoda                                 Soda
ABC Corp           1234              A999999    Soda Original                    Soda
ABC Corp           1234              B888888    Drink                                 Drink
ABC Corp           1234              A888888    Drink Multi flavored           Drink

Can some one guide how can i achieve this.

4 REPLIES 4
art297
Opal | Level 21

One way would be to add an array onto each record, namely an array that contains all combinations of your brand_name and, if you also want to be able to search regardless of order (e.g., both "vanilla soda" and "soda vanilla", permutations.

One way to do that is described in the thread:

In that thread, I posted a link to two SAS macros that can be used to get all combinations and permutations.

I, personally, would upcase all of the entries to that array.

Then you can simply do a search, upcasing the term and using the IN function.

Note: the link I provided to the combinations and permutations macros apparently is no longer on support.sas.com. However, you can still find them at: http://www.urz.uni-heidelberg.de/statistik/sas/doc/ts498-combperm.txt

Message was edited by: Arthur Tabachneck

PGStats
Opal | Level 21

You could create a list of keywords and use SQL with a CONTAINS condition:

data phrases;

length company_name $16 brand_id $10 brand_name $32;

input company_name & company_id brand_id $ brand_name & ;

datalines;

ABC Corp           1234              A888888    Vallina Soda          

ABC Corp           1234              A666666    Diet Soda Sugar Free 

ABC Corp           1234              A555555    Share a Soda Moment  

ABC Corp           1234              A333333    SodaRewards          

ABC Corp           1234              A222222    Soda                 

ABC Corp           1234              A000000    iSoda                

ABC Corp           1234              A999999    Soda Original        

ABC Corp           1234              B888888    Drink                

ABC Corp           1234              A888888    Drink Multi flavored 

;

data keywords;

input keyword :$12.;

datalines;

Soda

Drink

;

proc sql;

create table keyPhrases as

select p.*, k.keyword

from

     phrases as p cross join

     keywords as k

where upcase(p.brand_name) contains upcase(trim(k.keyword));

select * from keyPhrases;

quit;

PG

PG
Ksharp
Super User

As pointed out, you need firstly define these keywords .

data phrases;

length company_name $16 brand_id $10 brand_name $32;

input company_name & company_id brand_id $ brand_name & ;

datalines;

ABC Corp           1234              A888888    Vallina Soda         

ABC Corp           1234              A666666    Diet Soda Sugar Free

ABC Corp           1234              A555555    Share a Soda Moment 

ABC Corp           1234              A333333    SodaRewards         

ABC Corp           1234              A222222    Soda                

ABC Corp           1234              A000000    iSoda               

ABC Corp           1234              A999999    Soda Original       

ABC Corp           1234              B888888    Drink               

ABC Corp           1234              A888888    Drink Multi flavored

;

data want;

set phrases;

keyword=prxchange('s/.*(Soda|Drink).*/$1/io',-1,brand_name);

run;

Xia Keshan

art297
Opal | Level 21

Vaibhav,

You never did indicate whether PG's or KSharp's suggested code solved your problem. If they did, you really ought to mark the question as answered and give them points for providing correct and helpful responses.

While both PG's and KSharp's solutions are about as parsimonious as you can get, the question is whether they do what you need them to do.

The following is far from parsimonious, and may or may not do what you want. Both PG's and KSharps solutions could be expanded to do the same thing as the following code, but I've wanted to do something with the permutation's macro for quite some time.

The code, below, finds all permutations for the maximum number of strings found in the brand_name variable, and then adds all of those permutations into an array. Finally, a search macro lets one search the array to find any records that meet all of the strings identified in the macro's search parameter:

data have;

  informat company_name $30.;

  informat brand_id $7.;

  informat brand_name $30.;

  input company_name & company_id brand_id brand_name & word_count;

  cards;

ABC Corp  1234 A888888  Vanilla Soda  2

ABC Corp  1234 A666666  Diet Soda Sugar Free  4

ABC Corp  1234 A555555  Share a Soda Moment  4

ABC Corp  1234 A333333  SodaRewards  1

ABC Corp  1234 A222222  Soda  1

ABC Corp  1234 A000000  iSoda  1

ABC Corp  1234 A999999  Soda Original  2

ABC Corp  1234 B888888  Drink  1

ABC Corp  1234 A888888  Drink Multi flavored  3

;

%macro permute(r) / parmbuff;      /* the parmbuff option assigns */

  %let i=2;               /* the invocation parameter list to the */

  %let things=;                       /* macro variable &syspbuff */

  %do %while (%scan(&syspbuff,&i) ne );      /* scan the syspbuff */

    %let p&i="%scan(&syspbuff,&i)";             /* to determine r */

    %if &i=2 %then %let things=&&p&i;     /* and count the number */

    %else %let things=&things,&&p&i;            /* of elements, n */

    %let i=%eval(&i+1);

  %end;

  %let n=%eval(&i-2);

  data permute;

    drop i j copy;

    wordcnt=&r.;

    array check (*) r1-r&r;          /* create a total of r */

    %do m=1 %to &r;                   /* variables  for looping */

    do r&m = &things;

    %end;

      copy=0;

      do i=2 to &r;                 /* look for duplicate items */

        do j=1 to i-1;              /* and keep the unique ones */

          if check(j)=check(i) then copy+1;

        end;

      end;

      if copy = 0 then output;        /* writes to a SAS data set */

      if copy = 0 then put r1-r&r;        /* writes to the log    */

    %do m=1 %to &r;

    end;                               /* end the r DO LOOPS */

    %end;

  run;

%mend permute;

proc sql noprint;

  select max(word_count)

    into :max_count

      from have

  ;

quit;

data _null_;

  length words $32767;

  do j=1 to &max_count.;

    words=catx(',',words,j);

  end;

  call symput('words',words);

run;

%macro getperms;

  %do cntr=&max_count. %to 1 %by -1;

    %let wordsplus=&cntr.,&words.;

    %permute(&wordsplus.)

    %if &cntr. eq &max_count. %then %do;

      data need;

        set permute;

      run;

    %end;

    %else %do;

      proc append base=need data=permute nowarn force;

      run;

    %end;

  %end;

  data _null_;

    length excmd $1000;

    set need nobs=numobs end=lastone;

    by descending wordcnt;

    array permpos(&max_count.) r:;

    if _n_ eq 1 then do;

      call execute('data want; set have;');

      excmd=catt('array words(',numobs,') $255.;');

      call execute(excmd);

      excmd=catt('array word_in(',&max_count.,') $255.;');

      call execute(excmd);

      call execute("_n_=1; do while (scan(brand_name,_n_) ne '');");

      call execute('word_in(_n_)=upcase(scan(brand_name,_n_));_n_+1;end;');

      call execute('wordcntr=_n_-1; _n_=1;warraycnt=0;');

    end;

    highest=max(of permpos(*));

    if first.wordcnt then do;

      excmd=catx(' ','if',wordcnt,'le wordcntr then do;');

      call execute(excmd);

    end;

    excmd=catx(' ','if',highest,'le wordcntr then do;');

    call execute(excmd);

    excmd=catt('warraycnt+1; words(warraycnt)=word_in(',permpos(1),');');

    call execute(excmd);

    do i=2 to wordcnt;

      excmd=catt("words(warraycnt)=catx(' ',words(warraycnt),word_in(",permpos(i),'));');

      call execute(excmd);

    end;

    call execute('end;');

    if last.wordcnt then call execute('end;');

    if lastone then call execute('run;');

  run;

%mend;

%getperms

%macro search(searchstring);

  data result (drop=words: word_: warraycnt wordcntr);

    set want;

    array strings(*) $255. words:;

    do _n_=1 to dim(strings);

      if upcase("&searchstring.") eq strings(_n_) then do;

        output;

        leave;

      end;

    end;

  run;

%mend;

/*test cases*/

%search(soda)

%search(share soda moment)

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 793 views
  • 0 likes
  • 4 in conversation