Looking for an elegant solution

Accepted Solution Solved
Reply
Contributor
Posts: 64
Accepted Solution

Looking for an elegant solution

I think I can brute force this with a Cartesian product, but I was wondering if anyone could think of a more elegant solution.

DataA:  20 million observations, contains a string variable [sentence] which is a sentence of words

DataB:  A single variable [word] - a unique list of 2,000 words

Goal: Starting with DataA, add a column for each of the 2,000 words. The value should be the # of times that word occurred in the sentence

Thoughts?  I know it'd work if I created a ridiculously large cartesian product... [sentence] and [word] combination. Then loop through the dataset and using PRXNEXT() I could possibly count the occurrences. Then transpose.

But working with such large data, I'm am not sure if I could do this differently and perhaps save some processing time. But in the end of the day, I still have to do 20MM x 2K  comparisons so perhaps this *is* the best way.

,


Accepted Solutions
Solution
‎01-29-2015 04:41 AM
Esteemed Advisor
Esteemed Advisor
Posts: 7,253

Re: Looking for an elegant solution

You could also generate a simple datastep from the list of words:

data dataa;
  sentance="ABBEFFFAFGGHASDFLADSF"; output;
  sentance="DFGRATYSUIKLFLKSKJHUH"; output;
  sentance="IUUIUIWEWUYUYUYWEUEWU"; output;
run;

data datab;
  word="ABB"; output;
  word="FAF"; output;
  word="GRAT"; output;
  word="IU"; output;
run;

data _null_;
  set datab end=last;
  if _n_=1 then call execute('data want; set dataa;');
  call execute(strip(word)||'=0;
               pos=find(sentance,"'||strip(word)||'","t",1);
               do while(pos > 0);
                '||strip(word)||'=sum('||strip(word)||',1);
                pos=find(sentance,"'||strip(word)||'","t",pos+1);
               end;');
  if last then call execute(';run;');
run;

View solution in original post


All Replies
Grand Advisor
Posts: 17,445

Re: Looking for an elegant solution

Possible one data step solution. Part 1/2 could be created by using a scan of the smaller table to create macro variables.

  • Load the word list into a temporary array
  • A second array that is used to store the counts same size as the number of words
  • Loop through the list once each row, still doing the 20Mx2K comparisons

Interested to see what others come up Smiley Happy

Grand Advisor
Posts: 17,445

Re: Looking for an elegant solution

Another possibility

  • Split each sentence into single words with a SENTENCE_ID
  • Run a PROC FREQ on data from step1
  • Transpose and limit to word list
Grand Advisor
Posts: 10,241

Re: Looking for an elegant solution

Is case going to be an issue? Do you want a separate count for example words "Bird" and "bird" if "bird" is in your list, or both or just one? Or typos such as "biRd"?

How about plurals? Would "birds" in your sentence match "bird" in your list?

And similar for possesives, would "bird's" match either "birds" or "bird" in your list?

Compound nouns such as Mockingbird?

Contributor
Posts: 64

Re: Looking for an elegant solution

ballardw wrote:

Is case going to be an issue? Do you want a separate count for example words "Bird" and "bird" if "bird" is in your list, or both or just one? Or typos such as "biRd"?

How about plurals? Would "birds" in your sentence match "bird" in your list?

And similar for possesives, would "bird's" match either "birds" or "bird" in your list?

Compound nouns such as Mockingbird?

Luckily, in my case this won't be an issue.

I am using [Sentence] and [Word]  to make the problem more understandable. In reality, my "sentence" is actually a string of characters, like   ABBEFFFAFGGHASDFLADSF   and my "words"  are 3,4 and 5 letter combinations.  So case won't matter.

Respected Advisor
Posts: 4,609

Re: Looking for an elegant solution

Since it's substrings you are looking for, I would use a datastep that would:

  1. Give each sentence a unique ID number (_n_)
  2. on _n_=1 load the words into a string array
  3. use the FIND function repeatedly using the start-position argument and the T modifier
  4. output sentenceID, word, count for each word and sentence where count > 0

Since this is a cartesian product operation, the size of the inner loop is most important.

PG

PG
Solution
‎01-29-2015 04:41 AM
Esteemed Advisor
Esteemed Advisor
Posts: 7,253

Re: Looking for an elegant solution

You could also generate a simple datastep from the list of words:

data dataa;
  sentance="ABBEFFFAFGGHASDFLADSF"; output;
  sentance="DFGRATYSUIKLFLKSKJHUH"; output;
  sentance="IUUIUIWEWUYUYUYWEUEWU"; output;
run;

data datab;
  word="ABB"; output;
  word="FAF"; output;
  word="GRAT"; output;
  word="IU"; output;
run;

data _null_;
  set datab end=last;
  if _n_=1 then call execute('data want; set dataa;');
  call execute(strip(word)||'=0;
               pos=find(sentance,"'||strip(word)||'","t",1);
               do while(pos > 0);
                '||strip(word)||'=sum('||strip(word)||',1);
                pos=find(sentance,"'||strip(word)||'","t",pos+1);
               end;');
  if last then call execute(';run;');
run;

Contributor
Posts: 64

Re: Looking for an elegant solution

RW9 wrote:

data _null_;
  set datab end=last;
  if _n_=1 then call execute('data want; set dataa;');
  call execute(strip(word)||'=0;
               pos=find(sentance,"'||strip(word)||'","t",1);
               do while(pos > 0);
                '||strip(word)||'=sum('||strip(word)||',1);
                pos=find(sentance,"'||strip(word)||'","t",pos+1);
               end;');
  if last then call execute(';run;');
run;

^  This!

I haven't tried some of the other awesome suggestions yet, and I still intend on it.

I had a benchmark of 10+ hours on my first method, and this nifty code generating datasteps performed the same work in 7 minutes!

Valued Guide
Posts: 3,206

Re: Looking for an elegant solution

You are posting a question without telling the real problem you are trying to solve.

a string with possible 2000 words and a table with ca 2000 words. 

Avoiding the Cartesian product is a good start, but is telling you are having a mindset limited to a SQL world. There is more then just that.    

What are you trying to achieve?

May be a simple example to understand that..

---->-- ja karman --<-----
Contributor
Posts: 64

Re: Looking for an elegant solution

Jaap Karman wrote:

You are posting a question without telling the real problem you are trying to solve.

a string with possible 2000 words and a table with ca 2000 words.

Avoiding the Cartesian product is a good start, but is telling you are having a mindset limited to a SQL world. There is more then just that.   

What are you trying to achieve?

May be a simple example to understand that..

Yes, it's been a rough transition from SQL to SAS. My brain is always stuck in database world... thinking long instead of wide, thinking about index scans vs table scans... etc.

In reality, the problem is just counting sequential character patterns in a string. I tried to simplify the problem to "words in sentences" - because I feel like that's a similar type of problem that is handled in text mining.

I just posted my current solution, taking 10+ hours to loop through all the combinations. The cartesian product solution just won't scale very well to bigger data. I will try some interesting ideas posted by and next, to see what performance gains I can achieve.

Grand Advisor
Posts: 9,596

Re: Looking for an elegant solution

Hash Table . What does your data look like and what output you want ?

Esteemed Advisor
Posts: 5,202

Re: Looking for an elegant solution

Xia, you are into hashing a lot, that's not good for you in the long run Smiley Happy

Without telling you () how (you already got some nice suggestions you can start with), I just wish to comment on your goal.

Wide tables gives me rashes. They are usually useless.

So I would simply split your sentence into words, and directly transpose them in the same step (explicit output in the data step).

Then you can either use a hash table as suggested by to filter out your words (before he output preferably), or do an inner join in a succeeding step.

And then, just count in your favourite manner.

Data never sleeps
Grand Advisor
Posts: 9,596

Re: Looking for an elegant solution

"that's not good for you in the long run"

Reason ??

I know Hash is memory limitation method . For OP's scenario , consider running Hash three times ,each for one table . I still believe it is faster than SQL .

Best

Ksharp

Super Contributor
Posts: 251

Re: Looking for an elegant solution

A data set with 2M rows and 2000 columns seems to be out of visualization.

Here are tiny data sets with 10 words and 5 sentences. The code below rests on WHICHC() function to search the word in the &LIST. Hopefully, SAS gives sufficient length of characters to hold all 2000 words with separators. Another numeric &nLIST is used to display the counts of words in each sentence. The run-time for the Big data set is determined by the search function used.

Hoping this scales up to the demand you make.

data have;

length word $12;

input word;

datalines;

Atlanta

Baltimore

Boston

Hartford

NewJersey

NewLondon

NewYork

Philadelphia

Sanford

Tampa

;

run;

data have2;

length line $80;

input;

line = _infile_;

datalines;

Atlanta Atlanta Sanford Sanford NewLondon Boston NewYork Hartford Baltimore

Sanford Sanford NewLondon Boston NewYork Hartford Atlanta Baltimore

Hartford Atlanta Baltimore Tampa Sanford Boston Philadelphia NewJersey

Tampa Sanford Boston Philadelphia Sanford NewLondon Boston NewYork

Sanford NewLondon Boston Tampa Sanford Boston Philadelphia

;

run;

proc sql noprint;

   select quote(compress(word)) into :LIST separated by ', '

   from have;

   quit;

%put &LIST;

proc sql noprint;

   select compress(word) into :nLIST separated by ' '

   from have;

   quit;

%put &nLIST;

data want;

length ID 8 word $12;

array k[10] _temporary_;

array x[10] &nLIST;

do ID = 1 by 1 until(Last);

   set have2 end = last;

   nw = countw(line, ' ');

   do i = 1 to nw;

      word = scan(line, i, ' ');

      v = whichC(word, &LIST);

      k + 1;

   end;

   do i = 1 to dim(k);

      if missing(k) then k = 0;

      x = k;

   end;

   output;

   call missing(of k

  • );
  • end;

    keep ID &nLIST;

    run;

    Super Contributor
    Posts: 251

    Re: Looking for an elegant solution

    In my program posted a while ago, the use of Temporary Array can be dispensed with.

    The last part of my posting can be replaced with:

    data want;

    length ID 8 word $12;

    array x[10] &nLIST;

    do ID = 1 by 1 until(Last);

       set have2 end = last;

       nw = countw(line, ' ');

       do i = 1 to nw;

          word = scan(line, i, ' ');

          v = whichC(word, &LIST);

           x + 1;

       end;

       do i = 1 to dim(x);

          if missing(x) then x = 0;

       end;

       output;

       call missing(of x

  • );
  • end;

    keep ID &nLIST;

    run;

    ☑ This topic is SOLVED.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 19 replies
    • 686 views
    • 7 likes
    • 10 in conversation