Hello,
I am a SAS beginner and am experiencing an issue that I'm not sure how to resolve. Part of the problem is my lack of technical vocabulary to even explain what I think it happening. Hopefully you can help me!
I'm trying to extract data from an internet forum, but several of my columns are mismatched to the proper user, where it seems that it's attaching the html for the previous case to the next one.
I looked at the text file you originally posted. It contains multiples of the same web page.
I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:
data MMMdata.uberdrivermessagepagetest; /* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */ filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt'; infile fileref length=len; input record $varying10000. len; file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000; put record; retain startreplynum endreplynum lengthreplynum replynum replynumcheck startposter endposter lengthposter poster postercheck startstache endstache lengthstache stache stachecheck startlocation endlocation lengthlocation location locationcheck startnumposts endnumposts lengthnumposts numposts numpostscheck startpostdatetime endpostdatetime lengthpostdatetime postdatetime postdatetimecheck startsignature endsignature lengthsignature signature signaturecheck startage endage lengthage age agecheck ; *avatarflag=0; if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do; startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45; endreplynum=find(record,' on:'); lengthreplynum=endreplynum-startreplynum; replynum=substr(record,startreplynum,lengthreplynum); replynumcheck=1; end; if find(record,'<li class="avatar">') then avatarflag = 1; if find(record,'title="View the profile of ') ne 0 then do; startposter=find(record,'">')+2; endposter=find(record,'</a>'); lengthposter=endposter-startposter; poster=substr(record,startposter,lengthposter); postercheck=1; end; if find(record,'<li class="postgroup">') ne 0 then do; startstache=find(record,'li class="postgroup">')+21; endstache=find(record,'</li>'); lengthstache=endstache-startstache; stache=substr(record,startstache,lengthstache); stachecheck=1; end; if find(record,'<li class="postcount">Location: ') ne 0 then do; startlocation=find(record,'Location: ')+10; endlocation=find(record,'</li>'); lengthlocation=endlocation-startlocation; location=substr(record,startlocation,lengthlocation); locationcheck=1; end; if find(record,'<li class="postcount">Posts: ') ne 0 then do; startnumposts=find(record,'Posts: ')+7; endnumposts=find(record,'</li>'); lengthnumposts=endnumposts-startnumposts; numposts=substr(record,startnumposts,lengthnumposts); numpostscheck=1; end; if find(record,'<div class="smalltext">« <strong>') ne 0 then do; startpostdatetime=find(record,'on:</strong> ')+13; endpostdatetime=find(record,' &#'); lengthpostdatetime=endpostdatetime-startpostdatetime; postdatetime=substr(record,startpostdatetime,lengthpostdatetime); postdatetimecheck=1; /**/ endcheck=1; output; call missing(startreplynum); call missing(endreplynum); call missing(lengthreplynum); call missing(replynum); call missing(replynumcheck); call missing(startposter); call missing(endposter); call missing(lengthposter); call missing(poster); call missing(postercheck); call missing(startstache); call missing(endstache); call missing(lengthstache); call missing(stache); call missing(stachecheck); call missing(startlocation); call missing(endlocation); call missing(lengthlocation); call missing(location); call missing(locationcheck); call missing(startnumposts); call missing(endnumposts); call missing(lengthnumposts); call missing(numposts); call missing(numpostscheck); call missing(startpostdatetime); call missing(endpostdatetime); call missing(lengthpostdatetime); call missing(postdatetime); call missing(postdatetimecheck); call missing(startsignature); call missing(endsignature); call missing(lengthsignature); call missing(signature); call missing(signaturecheck); call missing(startage); call missing(endage); call missing(lengthage); call missing(age); call missing(agecheck); call missing(endcheck); /**/ end; if find(record,'<div class="signature" ') ne 0 then do; startsignature=find(record,'<br /><br />')+12; endsignature=find(record,'</div>'); lengthsignature=endsignature-startsignature; signature=substr(record,startsignature,lengthsignature); signaturecheck=1; end; if find(record,'<li class="postcount">Age: ') ne 0 then do; startage=find(record,'Age: ')+5; endage=find(record,'</li>'); lengthage=endage-startage; age=substr(record,startage,lengthage); agecheck=1; end; *if find(record, '<div class="postarea">') then endcheck=1; run;
Art, CEO, AnalystFinder.com
You'd have to post both your code and the html file. More than likely you're simply outputting each record upon capturing the data for the next record.
Art, CEO, AnalystFinder.com
Art, thanks for your reply. I've attached the html file.
Here is the code for the first program:
data MMMdata. uberdrivermessagepagetest;
filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/';
infile fileref length=len;
input record $varying10000. len;
file "c:\MMM\uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;
*avatarflag=0;
if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;
if find(record,'<li class="avatar">') then avatarflag = 1;
if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;
if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;
if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;
if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;
if find(record,'<div class="smalltext">« <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
end;
if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;
if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;
if find(record, '<div class="postarea">') then endcheck=1;
run;
And here is the code for the retain program:
data MMMdata. uberdriverclean;
set MMMdata. uberdrivermessagepagetest;
format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
final=0;
format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
final=0;
if replynum ne "" then replynumtemp=replynum;
if avatarflag ne "" then avatarflagtemp=avatarflag;
if poster ne "" then postertemp=poster;
if stache ne "" then stachetemp=stache;
if location ne "" then locationtemp=location;
if numposts ne "" then numpoststemp=numposts;
if postdatetime ne "" then postdatetimetemp=postdatetime;
if signature ne "" then signaturetemp=signature;
if age ne "" then agetemp=age;
if endcheck ne "" then do;
endchecktemp=endchek;
final=1;
end;
if final=0 then delete;
keep replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;
run;
Thank you!
You didn't post the full code for either of the programs.
Post the missing lines for each program. I can only presume that they're something like:
data whatever;
infile whaterver specs;
input record;
Art, CEO, AnalystFinder.com
I just edited the previous post to include the beginning of each program.
Thank you.
Are you sure you posted exactly what you ran. When I ran your first datastep I got a much smaller file and all of the variables in the SAS dataset had missing values.
Art, CEO, AnalystFinder.com
I looked at the text file you originally posted. It contains multiples of the same web page.
I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:
data MMMdata.uberdrivermessagepagetest; /* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */ filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt'; infile fileref length=len; input record $varying10000. len; file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000; put record; retain startreplynum endreplynum lengthreplynum replynum replynumcheck startposter endposter lengthposter poster postercheck startstache endstache lengthstache stache stachecheck startlocation endlocation lengthlocation location locationcheck startnumposts endnumposts lengthnumposts numposts numpostscheck startpostdatetime endpostdatetime lengthpostdatetime postdatetime postdatetimecheck startsignature endsignature lengthsignature signature signaturecheck startage endage lengthage age agecheck ; *avatarflag=0; if find(record,'<div class="smalltext">« <strong>Reply #') ne 0 then do; startreplynum=find(record,'<div class="smalltext">« <strong>Reply #')+45; endreplynum=find(record,' on:'); lengthreplynum=endreplynum-startreplynum; replynum=substr(record,startreplynum,lengthreplynum); replynumcheck=1; end; if find(record,'<li class="avatar">') then avatarflag = 1; if find(record,'title="View the profile of ') ne 0 then do; startposter=find(record,'">')+2; endposter=find(record,'</a>'); lengthposter=endposter-startposter; poster=substr(record,startposter,lengthposter); postercheck=1; end; if find(record,'<li class="postgroup">') ne 0 then do; startstache=find(record,'li class="postgroup">')+21; endstache=find(record,'</li>'); lengthstache=endstache-startstache; stache=substr(record,startstache,lengthstache); stachecheck=1; end; if find(record,'<li class="postcount">Location: ') ne 0 then do; startlocation=find(record,'Location: ')+10; endlocation=find(record,'</li>'); lengthlocation=endlocation-startlocation; location=substr(record,startlocation,lengthlocation); locationcheck=1; end; if find(record,'<li class="postcount">Posts: ') ne 0 then do; startnumposts=find(record,'Posts: ')+7; endnumposts=find(record,'</li>'); lengthnumposts=endnumposts-startnumposts; numposts=substr(record,startnumposts,lengthnumposts); numpostscheck=1; end; if find(record,'<div class="smalltext">« <strong>') ne 0 then do; startpostdatetime=find(record,'on:</strong> ')+13; endpostdatetime=find(record,' &#'); lengthpostdatetime=endpostdatetime-startpostdatetime; postdatetime=substr(record,startpostdatetime,lengthpostdatetime); postdatetimecheck=1; /**/ endcheck=1; output; call missing(startreplynum); call missing(endreplynum); call missing(lengthreplynum); call missing(replynum); call missing(replynumcheck); call missing(startposter); call missing(endposter); call missing(lengthposter); call missing(poster); call missing(postercheck); call missing(startstache); call missing(endstache); call missing(lengthstache); call missing(stache); call missing(stachecheck); call missing(startlocation); call missing(endlocation); call missing(lengthlocation); call missing(location); call missing(locationcheck); call missing(startnumposts); call missing(endnumposts); call missing(lengthnumposts); call missing(numposts); call missing(numpostscheck); call missing(startpostdatetime); call missing(endpostdatetime); call missing(lengthpostdatetime); call missing(postdatetime); call missing(postdatetimecheck); call missing(startsignature); call missing(endsignature); call missing(lengthsignature); call missing(signature); call missing(signaturecheck); call missing(startage); call missing(endage); call missing(lengthage); call missing(age); call missing(agecheck); call missing(endcheck); /**/ end; if find(record,'<div class="signature" ') ne 0 then do; startsignature=find(record,'<br /><br />')+12; endsignature=find(record,'</div>'); lengthsignature=endsignature-startsignature; signature=substr(record,startsignature,lengthsignature); signaturecheck=1; end; if find(record,'<li class="postcount">Age: ') ne 0 then do; startage=find(record,'Age: ')+5; endage=find(record,'</li>'); lengthage=endage-startage; age=substr(record,startage,lengthage); agecheck=1; end; *if find(record, '<div class="postarea">') then endcheck=1; run;
Art, CEO, AnalystFinder.com
That did it! Thank you so much, Art!--I really appreciate your help!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.