Dear All,
I am a teacher, I would like to help my students in the following manner with their vocabulary:
I have a list of words that I want to them to learn (say @ “D:\SASFiles\Vocab.txt”). I have a classic book, BleakHouse.txt in the same location. I would like to use SAS to extract sentences from the book for each word in the vocab.txt.
Here is an illustration of the final result.
alacrity | At the inn we found Mr. Boythorn on horseback, waiting with an open carriage to take us to his house, which was a few miles off. He was overjoyed to see us and dismounted with great alacrity. |
approbation | In case I should be taking a liberty in putting your ladyship on your guard when there's no necessity for it, you will endeavour, I should hope, to outlive my presumption, and I hall endeavour to outlive your disapprobation. |
audacious | I wouldn't be guilty of the audacious insolence of keeping a lady of the house waiting all this time for any earthly consideration. I would infinitely rather destroy myself--infinitely rather! |
capricious | Too capricious and imperious in all she does to be the cause of much surprise in those about her as to anything she does, this woman, loosely muffled, goes out into the moonlight. |
It will be too much of a favor, at least if someone can help to proceed in the right step, it will be highly appreciated:
Thanks in advance,
Jijil
Reading that book format it paragraphs is not hard.
data book;
length c p l 8 line pline $80
chapter $30 author $50 title $50
;
retain author title chapter ' ' c -1 p 0 l 0 pline ' ';
infile 'c:\downloads\BleakHouse.txt' truncover ;
input line $char80.;
if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;
else if line=:'Title:' and title=' ' then title = substr(line,8);
else if line=:'Author:' and author=' ' then author = substr(line,9);
else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;
else if line ^= ' ' then do;
if (pline=' ') then do; p+1; l=0; end;
l+1;
if chapter ne ' ' then output;
end;
pline=line;
run;
You could put the paragraphs into one long character variable.
data paragraphs;
do until(last.p);
set book;
by c p l;
length paragraph $30000;
paragraph = catx(' ',paragraph,line);
end;
keep c p paragraph chapter author title ;
run;
Hi:
I once wrote a program to do a frequency count of the words in Moby Dick. I posted the program here:
https://communities.sas.com/message/13656#13656
My text was structured a bit differently than your document, so the program would need tweaking a bit.
I see a different issue, though. Finding out whether you have a word match is easy. Extracting a sentence is harder. Because the sentence may span multiple lines. Your word of interest could be in the middle of a sentence that spans 4 lines. So that presents an issue...because it will be hard to go -forward and -back in the text file. But at least, if you captured the Chapter number and paragraph number and line number, you could tell approximately which chapter number and paragraph number someone had to look for in the text. I'm not sure that helps you in your quest. I think you could point to an approximate place, but would need to have some manual intervention to detect where the sentence actually started and stopped.
cynthia
Reading that book format it paragraphs is not hard.
data book;
length c p l 8 line pline $80
chapter $30 author $50 title $50
;
retain author title chapter ' ' c -1 p 0 l 0 pline ' ';
infile 'c:\downloads\BleakHouse.txt' truncover ;
input line $char80.;
if line=:'***END OF THE PROJECT GUTENBERG EBOOK' then stop;
else if line=:'Title:' and title=' ' then title = substr(line,8);
else if line=:'Author:' and author=' ' then author = substr(line,9);
else if line in: ('PREFACE','CHAPTER') then do; c+1; p=0; l=0; chapter=line; end;
else if line ^= ' ' then do;
if (pline=' ') then do; p+1; l=0; end;
l+1;
if chapter ne ' ' then output;
end;
pline=line;
run;
You could put the paragraphs into one long character variable.
data paragraphs;
do until(last.p);
set book;
by c p l;
length paragraph $30000;
paragraph = catx(' ',paragraph,line);
end;
keep c p paragraph chapter author title ;
run;
Dear Tom,
I need some more help. There is one other data set (work.vocab), which has only one variable named 'word'. How do I join this (work.paragraph) with work.vocab?
Let me try to illustrate:
The first word is Alacrity in work.vocab. The first occurrence of this word in work.paragraph is on Chapter XVIII, that is 1864th observation of work.paragraph. How do I join these two datasets with this condition.
If you allow me to be greedy and ask for more... for each word paragraph would be an extravaganza, a mere sentence would be nicer... However, I shall be thankful if you could help me with paragraphs.
Sincerely,
Jijil
Using your PARAGRAPHS dataset Tom, one more data step just about wraps it up.
The result is to add the sentence and vocab word to the dataset. Its a bit slow so I was just reading in the first few vocab words. Hope it helps JAR.
data matched_sentences(keep=c p paragraph chapter author title sentence vocab_word);
set paragraphs;
retain prxid i 0;
length sentence $10000;
if _N_=1 then
do;
/* load a temporary array with the vocab words */
array vocab(400) $40 _temporary_;
do i=1 to 400 until (no_more);
infile 'c:\temp\vocab.txt' truncover firstobs=2 obs=10 end=no_more;
input word $40.;
vocab{i} = word;
end;
end;
/* For each paragraph, loop through the vocab words looking for a match */
do word_num = 1 to i;
prxid = prxparse('/[^!\.\?]*?' || strip(vocab{word_num}) || '.*?[!\.\?]"*/');
call prxsubstr(prxid,lowcase(paragraph), pos, len);
if pos > 0 then
do;
sentence = left(substr(paragraph,pos,len));
vocab_word = vocab{word_num};
output;
end;
end;
run;
Its not perfect as "Mr." will end the sentence for example, but just needs to be a bit smarter in the prxparse.
I found the most delicate part to be the extraction of sentences from paragraphs. Here is my take on it :
filename text "&sasforum\datasets\BleakHouse.txt";
filename words "&sasforum\datasets\Vocab.txt";
data paragraphs(keep=parId par);
retain par;
length par $4000;
infile text end=flush truncover;
input;
if missing(_infile_) and not missing(par) then do;
parId + 1;
output;
call missing(par);
end;
else par = catx(" ", par, _infile_);
if flush and not missing(par) then do;
parId + 1;
output;
end;
run;
data sentences(keep=parId senId sentence);
length sentence $2000;
if prx1=0 then prx1 + prxparse("/""?[[:upper:]][^.!?]+[.!?]""?/");
set paragraphs;
sBeg = 1;
do until (pos = 0);
call prxnext(prx1, sBeg, -1, par, pos, len);
if pos > 0 then do;
sentence = catx(" ", sentence, substr(par, pos, len));
if scan(sentence, -1, " -") not in ("Mr.", "Dr.", "Mrs.") then do;
/* At this point you could filter out very short sentences */
senId + 1;
output;
call missing(sentence);
end;
end;
end;
if not missing(sentence) then do;
senId + 1;
output;
end;
run;
data vocab;
length word $20;
infile words truncover firstobs=2;
input word;
run;
data fancyWords(keep=parId senId word wordPos);
array words{1000} $20 (" ");
retain nbWords;
if missing(nbWords) then do;
do nbWords = 1 to dim(words) by 1 until (endVocab);
set vocab end=endVocab;
words{nbWords} = word;
end;
end;
set sentences;
lowSent = lowcase(sentence);
do i = 1 to nbWords;
word = words{i};
wordPos = index(lowSent, trim(word));
if wordPos > 0 then output;
end;
run;
proc sql;
select word, sentence
from fancyWords natural join sentences
order by word, senId;
quit;
PG
Dear PG,
Your code does it all..... Thank you so much. As I have already give marked Tom's answer correct, I can't change. I sincerely thank you too.
Sincerely,
Jijil
Dear PG,
I have used the above code on a simple report, is there any possible way that I can do the same thing on multiple reports at a time by merging them and using a seperate column called title based on word search to categorise which document the word was from.
Thank You
Bhavana
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.