Solved: Re: Read from a text file

JAR · Posted 08-10-2013 03:14 PM

Dear All,

I am a teacher, I would like to help my students in the following manner with their vocabulary:

I have a list of words that I want to them to learn (say @ “D:\SASFiles\Vocab.txt”). I have a classic book, BleakHouse.txt in the same location. I would like to use SAS to extract sentences from the book for each word in the vocab.txt.

Here is an illustration of the final result.

alacrity	At the inn we found Mr. Boythorn on horseback, waiting with an open carriage to take us to his house, which was a few miles off. He was overjoyed to see us and dismounted with great alacrity.
approbation	In case I should be taking a liberty in putting your ladyship on your guard when there's no necessity for it, you will endeavour, I should hope, to outlive my presumption, and I hall endeavour to outlive your disapprobation.
audacious	I wouldn't be guilty of the audacious insolence of keeping a lady of the house waiting all this time for any earthly consideration. I would infinitely rather destroy myself--infinitely rather!
capricious	Too capricious and imperious in all she does to be the cause of much surprise in those about her as to anything she does, this woman, loosely muffled, goes out into the moonlight.

It will be too much of a favor, at least if someone can help to proceed in the right step, it will be highly appreciated:

How to extract sentences from a text files (ebook in txt format).
How to compare the two data sets: vocab with words and the dataset with sentences.

Thanks in advance,

Jijil

Tom · Posted 08-10-2013 07:24 PM

Reading that book format it paragraphs is not hard.

data book;

length c p l 8 line pline $80

chapter $30 author $50 title $50

;

retain author title chapter ' ' c -1 p 0 l 0 pline ' ';

infile 'c:\downloads\BleakHouse.txt' truncover ;

input line $char80.;

if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;

else if line=:'Title:' and title=' ' then title = substr(line,8);

else if line=:'Author:' and author=' ' then author = substr(line,9);

else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;

else if line ^= ' ' then do;

if (pline=' ') then do; p+1; l=0; end;

l+1;

if chapter ne ' ' then output;

end;

pline=line;

run;

You could put the paragraphs into one long character variable.

data paragraphs;

do until(last.p);

set book;

by c p l;

length paragraph $30000;

paragraph = catx(' ',paragraph,line);

end;

keep c p paragraph chapter author title ;

run;

View solution in original post

Cynthia_sas · Posted 08-10-2013 05:09 PM

Hi:

I once wrote a program to do a frequency count of the words in Moby Dick. I posted the program here:

https://communities.sas.com/message/13656#13656

My text was structured a bit differently than your document, so the program would need tweaking a bit.

I see a different issue, though. Finding out whether you have a word match is easy. Extracting a sentence is harder. Because the sentence may span multiple lines. Your word of interest could be in the middle of a sentence that spans 4 lines. So that presents an issue...because it will be hard to go -forward and -back in the text file. But at least, if you captured the Chapter number and paragraph number and line number, you could tell approximately which chapter number and paragraph number someone had to look for in the text. I'm not sure that helps you in your quest. I think you could point to an approximate place, but would need to have some manual intervention to detect where the sentence actually started and stopped.

cynthia

Tom · Posted 08-10-2013 07:24 PM

Reading that book format it paragraphs is not hard.

data book;

length c p l 8 line pline $80

chapter $30 author $50 title $50

;

retain author title chapter ' ' c -1 p 0 l 0 pline ' ';

infile 'c:\downloads\BleakHouse.txt' truncover ;

input line $char80.;

if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;

else if line=:'Title:' and title=' ' then title = substr(line,8);

else if line=:'Author:' and author=' ' then author = substr(line,9);

else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;

else if line ^= ' ' then do;

if (pline=' ') then do; p+1; l=0; end;

l+1;

if chapter ne ' ' then output;

end;

pline=line;

run;

You could put the paragraphs into one long character variable.

data paragraphs;

do until(last.p);

set book;

by c p l;

length paragraph $30000;

paragraph = catx(' ',paragraph,line);

end;

keep c p paragraph chapter author title ;

run;

JAR · Posted 08-11-2013 12:04 AM

Dear Tom,

I need some more help. There is one other data set (work.vocab), which has only one variable named 'word'. How do I join this (work.paragraph) with work.vocab?

Let me try to illustrate:

The first word is Alacrity in work.vocab. The first occurrence of this word in work.paragraph is on Chapter XVIII, that is 1864th observation of work.paragraph. How do I join these two datasets with this condition.

If you allow me to be greedy and ask for more... for each word paragraph would be an extravaganza, a mere sentence would be nicer... However, I shall be thankful if you could help me with paragraphs.

Sincerely,

Jijil

JerryLeBreton · Posted 08-11-2013 01:14 AM

Using your PARAGRAPHS dataset Tom, one more data step just about wraps it up.

The result is to add the sentence and vocab word to the dataset. Its a bit slow so I was just reading in the first few vocab words. Hope it helps JAR.

data matched_sentences(keep=c p paragraph chapter author title sentence vocab_word);

set paragraphs;

retain prxid i 0;

length sentence $10000;

if _N_=1 then

do;

/* load a temporary array with the vocab words */

array vocab(400) $40 _temporary_;

do i=1 to 400 until (no_more);

infile 'c:\temp\vocab.txt' truncover firstobs=2 obs=10 end=no_more;

input word $40.;

vocab{i} = word;

end;

/* For each paragraph, loop through the vocab words looking for a match */

do word_num = 1 to i;

prxid = prxparse('/[^!\.\?]*?' || strip(vocab{word_num}) || '.*?[!\.\?]"*/');

call prxsubstr(prxid,lowcase(paragraph), pos, len);

if pos > 0 then

do;

sentence = left(substr(paragraph,pos,len));

vocab_word = vocab{word_num};

output;

end;

run;

Its not perfect as "Mr." will end the sentence for example, but just needs to be a bit smarter in the prxparse.

PGStats · Posted 08-10-2013 11:42 PM

I found the most delicate part to be the extraction of sentences from paragraphs. Here is my take on it :

filename text "&sasforum\datasets\BleakHouse.txt";
filename words "&sasforum\datasets\Vocab.txt";

data paragraphs(keep=parId par);
retain par;
length par $4000;
infile text end=flush truncover;
input;
if missing(_infile_) and not missing(par) then do;
     parId + 1;
     output;
     call missing(par);
     end;
else par = catx(" ", par, _infile_);
if flush and not missing(par) then do;
     parId + 1;
     output;
     end;
run;

data sentences(keep=parId senId sentence);
length sentence $2000;
if prx1=0 then prx1 + prxparse("/""?[[:upper:]][^.!?]+[.!?]""?/");
set paragraphs;
sBeg = 1;
do until (pos = 0);
     call prxnext(prx1, sBeg, -1, par, pos, len);
     if pos > 0 then do;
          sentence = catx(" ", sentence, substr(par, pos, len));
          if scan(sentence, -1, " -") not in ("Mr.", "Dr.", "Mrs.") then do;
          /* At this point you could filter out very short sentences */
           senId + 1;
           output;
           call missing(sentence);
           end;
      end;
end;
if not missing(sentence) then do;
     senId + 1;
     output;
     end;
run;

data vocab;
length word $20;
infile words truncover firstobs=2;
input word;
run;

data fancyWords(keep=parId senId word wordPos);
array words{1000} $20 (" ");
retain nbWords;
if missing(nbWords) then do;
     do nbWords = 1 to dim(words) by 1 until (endVocab);
          set vocab end=endVocab;
          words{nbWords} = word;
          end;
     end;
set sentences;
lowSent = lowcase(sentence);
do i = 1 to nbWords;
     word = words{i};
     wordPos = index(lowSent, trim(word));
     if wordPos > 0 then output;
     end;
run;

proc sql;
select word, sentence
from fancyWords natural join sentences
order by word, senId;
quit;

PG

JAR · Posted 08-11-2013 01:32 AM

Dear PG,

Your code does it all..... Thank you so much. As I have already give marked Tom's answer correct, I can't change. I sincerely thank you too.

Sincerely,

Jijil

saibhavana · Posted 03-09-2016 10:22 AM

Dear PG,

I have used the above code on a simple report, is there any possible way that I can do the same thing on multiple reports at a time by merging them and using a seperate column called title based on word search to categorise which document the word was from.

Thank You

Bhavana

Registration is open

SAS Training: Just a Click Away