Hello,
I am a SAS beginner and am experiencing an issue that I'm not sure how to resolve. Part of the problem is my lack of technical vocabulary to even explain what I think it happening. Hopefully you can help me!
I'm trying to extract data from an internet forum, but several of my columns are mismatched to the proper user, where it seems that it's attaching the html for the previous case to the next one.
I looked at the text file you originally posted. It contains multiples of the same web page.
I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:
data MMMdata.uberdrivermessagepagetest; /* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */ filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt'; infile fileref length=len; input record $varying10000. len; file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000; put record; retain startreplynum endreplynum lengthreplynum replynum replynumcheck startposter endposter lengthposter poster postercheck startstache endstache lengthstache stache stachecheck startlocation endlocation lengthlocation location locationcheck startnumposts endnumposts lengthnumposts numposts numpostscheck startpostdatetime endpostdatetime lengthpostdatetime postdatetime postdatetimecheck startsignature endsignature lengthsignature signature signaturecheck startage endage lengthage age agecheck ; *avatarflag=0; if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do; startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45; endreplynum=find(record,' on:'); lengthreplynum=endreplynum-startreplynum; replynum=substr(record,startreplynum,lengthreplynum); replynumcheck=1; end; if find(record,'<li class="avatar">') then avatarflag = 1; if find(record,'title="View the profile of ') ne 0 then do; startposter=find(record,'">')+2; endposter=find(record,'</a>'); lengthposter=endposter-startposter; poster=substr(record,startposter,lengthposter); postercheck=1; end; if find(record,'<li class="postgroup">') ne 0 then do; startstache=find(record,'li class="postgroup">')+21; endstache=find(record,'</li>'); lengthstache=endstache-startstache; stache=substr(record,startstache,lengthstache); stachecheck=1; end; if find(record,'<li class="postcount">Location: ') ne 0 then do; startlocation=find(record,'Location: ')+10; endlocation=find(record,'</li>'); lengthlocation=endlocation-startlocation; location=substr(record,startlocation,lengthlocation); locationcheck=1; end; if find(record,'<li class="postcount">Posts: ') ne 0 then do; startnumposts=find(record,'Posts: ')+7; endnumposts=find(record,'</li>'); lengthnumposts=endnumposts-startnumposts; numposts=substr(record,startnumposts,lengthnumposts); numpostscheck=1; end; if find(record,'<div class="smalltext">« <strong>') ne 0 then do; startpostdatetime=find(record,'on:</strong> ')+13; endpostdatetime=find(record,' &#'); lengthpostdatetime=endpostdatetime-startpostdatetime; postdatetime=substr(record,startpostdatetime,lengthpostdatetime); postdatetimecheck=1; /**/ endcheck=1; output; call missing(startreplynum); call missing(endreplynum); call missing(lengthreplynum); call missing(replynum); call missing(replynumcheck); call missing(startposter); call missing(endposter); call missing(lengthposter); call missing(poster); call missing(postercheck); call missing(startstache); call missing(endstache); call missing(lengthstache); call missing(stache); call missing(stachecheck); call missing(startlocation); call missing(endlocation); call missing(lengthlocation); call missing(location); call missing(locationcheck); call missing(startnumposts); call missing(endnumposts); call missing(lengthnumposts); call missing(numposts); call missing(numpostscheck); call missing(startpostdatetime); call missing(endpostdatetime); call missing(lengthpostdatetime); call missing(postdatetime); call missing(postdatetimecheck); call missing(startsignature); call missing(endsignature); call missing(lengthsignature); call missing(signature); call missing(signaturecheck); call missing(startage); call missing(endage); call missing(lengthage); call missing(age); call missing(agecheck); call missing(endcheck); /**/ end; if find(record,'<div class="signature" ') ne 0 then do; startsignature=find(record,'<br /><br />')+12; endsignature=find(record,'</div>'); lengthsignature=endsignature-startsignature; signature=substr(record,startsignature,lengthsignature); signaturecheck=1; end; if find(record,'<li class="postcount">Age: ') ne 0 then do; startage=find(record,'Age: ')+5; endage=find(record,'</li>'); lengthage=endage-startage; age=substr(record,startage,lengthage); agecheck=1; end; *if find(record, '<div class="postarea">') then endcheck=1; run;
Art, CEO, AnalystFinder.com
You'd have to post both your code and the html file. More than likely you're simply outputting each record upon capturing the data for the next record.
Art, CEO, AnalystFinder.com
Art, thanks for your reply. I've attached the html file.
Here is the code for the first program:
data MMMdata. uberdrivermessagepagetest;
filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/';
infile fileref length=len;
input record $varying10000. len;
file "c:\MMM\uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;
*avatarflag=0;
if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;
if find(record,'<li class="avatar">') then avatarflag = 1;
if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;
if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;
if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;
if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;
if find(record,'<div class="smalltext">« <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
end;
if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;
if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;
if find(record, '<div class="postarea">') then endcheck=1;
run;
And here is the code for the retain program:
data MMMdata. uberdriverclean;
set MMMdata. uberdrivermessagepagetest;
format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
final=0;
format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
final=0;
if replynum ne "" then replynumtemp=replynum;
if avatarflag ne "" then avatarflagtemp=avatarflag;
if poster ne "" then postertemp=poster;
if stache ne "" then stachetemp=stache;
if location ne "" then locationtemp=location;
if numposts ne "" then numpoststemp=numposts;
if postdatetime ne "" then postdatetimetemp=postdatetime;
if signature ne "" then signaturetemp=signature;
if age ne "" then agetemp=age;
if endcheck ne "" then do;
endchecktemp=endchek;
final=1;
end;
if final=0 then delete;
keep replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
run;
Thank you!
You didn't post the full code for either of the programs.
Post the missing lines for each program. I can only presume that they're something like:
data whatever;
infile whaterver specs;
input record;
Art, CEO, AnalystFinder.com
I just edited the previous post to include the beginning of each program.
Thank you.
Are you sure you posted exactly what you ran. When I ran your first datastep I got a much smaller file and all of the variables in the SAS dataset had missing values.
Art, CEO, AnalystFinder.com
I looked at the text file you originally posted. It contains multiples of the same web page.
I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:
data MMMdata.uberdrivermessagepagetest; /* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */ filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt'; infile fileref length=len; input record $varying10000. len; file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000; put record; retain startreplynum endreplynum lengthreplynum replynum replynumcheck startposter endposter lengthposter poster postercheck startstache endstache lengthstache stache stachecheck startlocation endlocation lengthlocation location locationcheck startnumposts endnumposts lengthnumposts numposts numpostscheck startpostdatetime endpostdatetime lengthpostdatetime postdatetime postdatetimecheck startsignature endsignature lengthsignature signature signaturecheck startage endage lengthage age agecheck ; *avatarflag=0; if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do; startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45; endreplynum=find(record,' on:'); lengthreplynum=endreplynum-startreplynum; replynum=substr(record,startreplynum,lengthreplynum); replynumcheck=1; end; if find(record,'<li class="avatar">') then avatarflag = 1; if find(record,'title="View the profile of ') ne 0 then do; startposter=find(record,'">')+2; endposter=find(record,'</a>'); lengthposter=endposter-startposter; poster=substr(record,startposter,lengthposter); postercheck=1; end; if find(record,'<li class="postgroup">') ne 0 then do; startstache=find(record,'li class="postgroup">')+21; endstache=find(record,'</li>'); lengthstache=endstache-startstache; stache=substr(record,startstache,lengthstache); stachecheck=1; end; if find(record,'<li class="postcount">Location: ') ne 0 then do; startlocation=find(record,'Location: ')+10; endlocation=find(record,'</li>'); lengthlocation=endlocation-startlocation; location=substr(record,startlocation,lengthlocation); locationcheck=1; end; if find(record,'<li class="postcount">Posts: ') ne 0 then do; startnumposts=find(record,'Posts: ')+7; endnumposts=find(record,'</li>'); lengthnumposts=endnumposts-startnumposts; numposts=substr(record,startnumposts,lengthnumposts); numpostscheck=1; end; if find(record,'<div class="smalltext">« <strong>') ne 0 then do; startpostdatetime=find(record,'on:</strong> ')+13; endpostdatetime=find(record,' &#'); lengthpostdatetime=endpostdatetime-startpostdatetime; postdatetime=substr(record,startpostdatetime,lengthpostdatetime); postdatetimecheck=1; /**/ endcheck=1; output; call missing(startreplynum); call missing(endreplynum); call missing(lengthreplynum); call missing(replynum); call missing(replynumcheck); call missing(startposter); call missing(endposter); call missing(lengthposter); call missing(poster); call missing(postercheck); call missing(startstache); call missing(endstache); call missing(lengthstache); call missing(stache); call missing(stachecheck); call missing(startlocation); call missing(endlocation); call missing(lengthlocation); call missing(location); call missing(locationcheck); call missing(startnumposts); call missing(endnumposts); call missing(lengthnumposts); call missing(numposts); call missing(numpostscheck); call missing(startpostdatetime); call missing(endpostdatetime); call missing(lengthpostdatetime); call missing(postdatetime); call missing(postdatetimecheck); call missing(startsignature); call missing(endsignature); call missing(lengthsignature); call missing(signature); call missing(signaturecheck); call missing(startage); call missing(endage); call missing(lengthage); call missing(age); call missing(agecheck); call missing(endcheck); /**/ end; if find(record,'<div class="signature" ') ne 0 then do; startsignature=find(record,'<br /><br />')+12; endsignature=find(record,'</div>'); lengthsignature=endsignature-startsignature; signature=substr(record,startsignature,lengthsignature); signaturecheck=1; end; if find(record,'<li class="postcount">Age: ') ne 0 then do; startage=find(record,'Age: ')+5; endage=find(record,'</li>'); lengthage=endage-startage; age=substr(record,startage,lengthage); agecheck=1; end; *if find(record, '<div class="postarea">') then endcheck=1; run;
Art, CEO, AnalystFinder.com
That did it! Thank you so much, Art!--I really appreciate your help!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.