BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Dear All,

I am a teacher, I would like to help my students in the following manner with their vocabulary:

I have a list of words that I want to them to learn (say @ “D:\SASFiles\Vocab.txt”). I have a classic book, BleakHouse.txt in the same location. I would like to use SAS to extract sentences from the book for each word in the vocab.txt.

Here is an illustration of the final result.

alacrity

At the inn we found Mr. Boythorn on horseback, waiting with an open carriage to take us to his house, which was a few miles off. He was overjoyed to see us and dismounted with great alacrity.

approbation

In case I should be taking a liberty in putting your ladyship on your guard when there's no necessity for it, you will endeavour, I should hope, to outlive my presumption, and I hall endeavour to outlive your disapprobation.

audacious

I wouldn't be guilty of the audacious insolence of keeping a lady of the house waiting all this time for any earthly consideration. I would infinitely rather destroy myself--infinitely rather!

capricious

Too capricious and imperious in all she does to be the cause of much surprise in those about her as to anything she does, this woman, loosely muffled, goes out into the moonlight.

It will be too much of a favor, at least if someone can help to proceed in the right step, it will be highly appreciated:

  1. How to extract sentences from a text files (ebook in txt format).
  2. How to compare the two data sets: vocab with words and the dataset with sentences.

Thanks in advance,

Jijil

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Reading that book format it paragraphs is not hard.

data book;

  length c p l 8 line pline $80

        chapter $30 author $50 title $50

  ;

  retain author title chapter ' ' c -1 p 0 l 0 pline ' ';

  infile 'c:\downloads\BleakHouse.txt' truncover ;

  input line $char80.;

if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;

else if line=:'Title:' and title=' ' then title = substr(line,8);

else if line=:'Author:' and author=' ' then author = substr(line,9);

else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;

else if line ^= ' ' then do;

  if (pline=' ') then do; p+1; l=0; end;

  l+1;

  if chapter ne ' ' then output;

end;

pline=line;

run;

You could put the paragraphs into one long character variable.

data paragraphs;

  do until(last.p);

     set book;

      by c p l;

      length paragraph $30000;

   paragraph = catx(' ',paragraph,line);

  end;

  keep c p paragraph chapter author title ;

run;

View solution in original post

7 REPLIES 7
Cynthia_sas
SAS Super FREQ

Hi:

  I once wrote a program to do a frequency count of the words in Moby Dick. I posted the program here:

https://communities.sas.com/message/13656#13656

  My text was structured a bit differently than your document, so the program would need tweaking a bit.

  I see a different issue, though. Finding out whether you have a word match is easy. Extracting a sentence is harder. Because the sentence may span multiple lines. Your word of interest could be in the middle of a sentence that spans 4 lines. So that presents an issue...because it will be hard to go -forward and -back in the text file. But at least, if you captured the Chapter number and paragraph number and line number, you could tell approximately which chapter number and paragraph number someone had to look for in the text. I'm not sure that helps you in your quest. I think you could point to an approximate place, but would need to have some manual intervention to detect where the sentence actually started and stopped.

cynthia

Tom
Super User Tom
Super User

Reading that book format it paragraphs is not hard.

data book;

  length c p l 8 line pline $80

        chapter $30 author $50 title $50

  ;

  retain author title chapter ' ' c -1 p 0 l 0 pline ' ';

  infile 'c:\downloads\BleakHouse.txt' truncover ;

  input line $char80.;

if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;

else if line=:'Title:' and title=' ' then title = substr(line,8);

else if line=:'Author:' and author=' ' then author = substr(line,9);

else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;

else if line ^= ' ' then do;

  if (pline=' ') then do; p+1; l=0; end;

  l+1;

  if chapter ne ' ' then output;

end;

pline=line;

run;

You could put the paragraphs into one long character variable.

data paragraphs;

  do until(last.p);

     set book;

      by c p l;

      length paragraph $30000;

   paragraph = catx(' ',paragraph,line);

  end;

  keep c p paragraph chapter author title ;

run;

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Dear Tom,

I need some more help. There is one other data set (work.vocab), which has only one variable named 'word'. How do I join this (work.paragraph) with work.vocab?

Let me try to illustrate:

The first word is Alacrity in work.vocab. The first occurrence of this word in work.paragraph is on Chapter XVIII, that is 1864th observation of work.paragraph. How do I join these two datasets with this condition.

If you allow me to be greedy and ask for more... for each word paragraph would be an extravaganza, a mere sentence would be nicer... However, I shall be thankful if you could help me with paragraphs.

Sincerely,

Jijil

JerryLeBreton
Pyrite | Level 9

Using your PARAGRAPHS dataset Tom, one more data step just about wraps it up. 

The result is to add the sentence and vocab word to the dataset.  Its a bit slow so I was just reading in the first few vocab words.  Hope it helps JAR.

data matched_sentences(keep=c p paragraph chapter author title sentence vocab_word);

  set paragraphs;

  retain prxid i  0;

  length sentence $10000;

   if _N_=1 then

     do;

      /* load a temporary array with the vocab words */

      array vocab(400) $40 _temporary_;

      do i=1 to 400 until (no_more);

         infile 'c:\temp\vocab.txt' truncover firstobs=2 obs=10 end=no_more;

         input word $40.;

         vocab{i} = word;

      end;

     end;

  /* For each paragraph, loop through the vocab words looking for a match */

  do word_num = 1 to i;

     prxid = prxparse('/[^!\.\?]*?' || strip(vocab{word_num}) || '.*?[!\.\?]"*/');

     call prxsubstr(prxid,lowcase(paragraph), pos, len);

     if pos > 0 then

       do;

        sentence = left(substr(paragraph,pos,len));

        vocab_word = vocab{word_num};

        output;

       end;

  end;  

run;

Its not perfect as "Mr." will end the sentence for example, but just needs to be a bit smarter in the prxparse.

PGStats
Opal | Level 21

I found the most delicate part to be the extraction of sentences from paragraphs. Here is my take on it :

 

filename text "&sasforum\datasets\BleakHouse.txt";
filename words "&sasforum\datasets\Vocab.txt";

data paragraphs(keep=parId par);
retain par;
length par $4000;
infile text end=flush truncover;
input;
if missing(_infile_) and not missing(par) then do;
     parId + 1;
     output;
     call missing(par);
     end;
else par = catx(" ", par, _infile_);
if flush and not missing(par) then do;
     parId + 1;
     output;
     end;
run;

data sentences(keep=parId senId sentence);
length sentence $2000;
if prx1=0 then prx1 + prxparse("/""?[[:upper:]][^.!?]+[.!?]""?/");
set paragraphs;
sBeg = 1;
do until (pos = 0);
     call prxnext(prx1, sBeg, -1, par, pos, len);
     if pos > 0 then do;
          sentence = catx(" ", sentence, substr(par, pos, len));
          if  scan(sentence, -1, " -") not in ("Mr.", "Dr.", "Mrs.") then do;
          /* At this point you could filter out very short sentences */
           senId + 1;
           output;
           call missing(sentence);
           end;
      end;
end;
if not missing(sentence) then do;
     senId + 1;
     output;
     end;
run;

data vocab;
length word $20;
infile words truncover firstobs=2;
input word;
run;

data fancyWords(keep=parId senId word wordPos);
array words{1000} $20 (" ");
retain nbWords;
if missing(nbWords) then do;
     do nbWords = 1 to dim(words) by 1 until (endVocab);
          set vocab end=endVocab;
          words{nbWords} = word;
          end;
     end;
set sentences;
lowSent = lowcase(sentence);
do i = 1 to nbWords;
     word = words{i};
     wordPos = index(lowSent, trim(word));
     if wordPos > 0 then output;
     end;
run;

proc sql;
select word, sentence
from fancyWords natural join sentences
order by word, senId;
quit;

PG

PG
JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Dear PG,

Your code does it all..... Thank you so much. As I have already give marked Tom's answer correct, I can't change. I sincerely thank you too.

Sincerely,

Jijil

saibhavana
Calcite | Level 5

Dear PG,

 

I have used the above code on a simple report, is there any possible way that I can do the same thing on multiple reports at a time by merging them and using a seperate column called title based on word search to categorise which document the word was from.

 

Thank You

Bhavana

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 4321 views
  • 3 likes
  • 6 in conversation