BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
CASLU
Fluorite | Level 6

Hello, 

 

I am a SAS beginner and am experiencing an issue that I'm not sure how to resolve. Part of the problem is my lack of technical vocabulary to even explain what I think it happening. Hopefully you can help me!

 

I'm trying to extract data from an internet forum, but several of my columns are mismatched to the proper user, where it seems that it's attaching the html for the previous case to the next one.

 
For example, the html is structured like this:
  
User1name
User1Reply#
User1postdate
 
User2name
User2Reply#
User2postdate
etc...
 
But when it gathers the data into a spreadsheet it shows user2 having user1's info, and all the following ones are behind. For example:
 
User2name           User1Reply#            User1postdate
User3name           User2Reply#            User2postdate
 
This is the result of using 2 different programs; the first one is a series of find commands, while the second is a retain command to get all data into one spreadsheet; however, I'm not sure which program needs to be adjusted or how to fix it. 
 
Hopefully this makes sense. Thanks in advance for your help!
1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

I looked at the text file you originally posted. It contains multiples of the same web page.

 

I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:

data MMMdata.uberdrivermessagepagetest; 
/* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */
filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt';
infile fileref length=len;
input record $varying10000. len; 
file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

retain 
 startreplynum
 endreplynum
 lengthreplynum
 replynum
 replynumcheck

 startposter
 endposter
 lengthposter
 poster
 postercheck
 
 startstache
 endstache
 lengthstache
 stache
 stachecheck

 startlocation
 endlocation
 lengthlocation
 location
 locationcheck

 startnumposts
 endnumposts
 lengthnumposts
 numposts
 numpostscheck

 startpostdatetime
 endpostdatetime
 lengthpostdatetime
 postdatetime
 postdatetimecheck

 startsignature
 endsignature
 lengthsignature
 signature
 signaturecheck

 startage
 endage
 lengthage
 age
 agecheck
 ;
*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
/**/
endcheck=1;
output;
call missing(startreplynum);
call missing(endreplynum);
call missing(lengthreplynum);
call missing(replynum);
call missing(replynumcheck);

call missing(startposter);
call missing(endposter);
call missing(lengthposter);
call missing(poster);
call missing(postercheck);

call missing(startstache);
call missing(endstache);
call missing(lengthstache);
call missing(stache);
call missing(stachecheck);

call missing(startlocation);
call missing(endlocation);
call missing(lengthlocation);
call missing(location);
call missing(locationcheck);

call missing(startnumposts);
call missing(endnumposts);
call missing(lengthnumposts);
call missing(numposts);
call missing(numpostscheck);

call missing(startpostdatetime);
call missing(endpostdatetime);
call missing(lengthpostdatetime);
call missing(postdatetime);
call missing(postdatetimecheck);

call missing(startsignature);
call missing(endsignature);
call missing(lengthsignature);
call missing(signature);
call missing(signaturecheck);

call missing(startage);
call missing(endage);
call missing(lengthage);
call missing(age);
call missing(agecheck);

call missing(endcheck);
/**/
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

*if find(record, '<div class="postarea">') then endcheck=1;

run;

Art, CEO, AnalystFinder.com

 

 

View solution in original post

7 REPLIES 7
art297
Opal | Level 21

You'd have to post both your code and the html file. More than likely you're simply outputting each record upon capturing the data for the next record.

 

Art, CEO, AnalystFinder.com

 

CASLU
Fluorite | Level 6

Art, thanks for your reply. I've attached the html file.

 

Here is the code for the first program:

 

data MMMdata. uberdrivermessagepagetest; 
filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/';
infile fileref length=len;
input record $varying10000. len; 
file "c:\MMM\uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

if find(record, '<div class="postarea">') then endcheck=1;

run;

 

 

And here is the code for the retain program:

 

data MMMdata. uberdriverclean; 
set MMMdata. uberdrivermessagepagetest; 

format replynumtemp best12. avatarflagtemp best12.; 
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp; 
final=0;

format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp; 
final=0;
if replynum ne "" then replynumtemp=replynum; 
if avatarflag ne "" then avatarflagtemp=avatarflag;
if poster ne "" then postertemp=poster;
if stache ne "" then stachetemp=stache;
if location ne "" then locationtemp=location;
if numposts ne "" then numpoststemp=numposts;
if postdatetime ne "" then postdatetimetemp=postdatetime;
if signature ne "" then signaturetemp=signature;
if age ne "" then agetemp=age;

if endcheck ne "" then do;
endchecktemp=endchek;

final=1;
end;

if final=0 then delete;
keep replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;

run;

 

Thank you!

art297
Opal | Level 21

You didn't post the full code for either of the programs.

 

Post the missing lines for each program. I can only presume that they're something like:

data whatever;

  infile whaterver specs;

  input record;

 

Art, CEO, AnalystFinder.com

 

 

CASLU
Fluorite | Level 6

I just edited the previous post to include the beginning of each program. 

 

Thank you.

art297
Opal | Level 21

Are you sure you posted exactly what you ran. When I ran your first datastep I got a much smaller file and all of the variables in the SAS dataset had missing values.

 

Art, CEO, AnalystFinder.com

 

art297
Opal | Level 21

I looked at the text file you originally posted. It contains multiples of the same web page.

 

I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:

data MMMdata.uberdrivermessagepagetest; 
/* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */
filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt';
infile fileref length=len;
input record $varying10000. len; 
file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

retain 
 startreplynum
 endreplynum
 lengthreplynum
 replynum
 replynumcheck

 startposter
 endposter
 lengthposter
 poster
 postercheck
 
 startstache
 endstache
 lengthstache
 stache
 stachecheck

 startlocation
 endlocation
 lengthlocation
 location
 locationcheck

 startnumposts
 endnumposts
 lengthnumposts
 numposts
 numpostscheck

 startpostdatetime
 endpostdatetime
 lengthpostdatetime
 postdatetime
 postdatetimecheck

 startsignature
 endsignature
 lengthsignature
 signature
 signaturecheck

 startage
 endage
 lengthage
 age
 agecheck
 ;
*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
/**/
endcheck=1;
output;
call missing(startreplynum);
call missing(endreplynum);
call missing(lengthreplynum);
call missing(replynum);
call missing(replynumcheck);

call missing(startposter);
call missing(endposter);
call missing(lengthposter);
call missing(poster);
call missing(postercheck);

call missing(startstache);
call missing(endstache);
call missing(lengthstache);
call missing(stache);
call missing(stachecheck);

call missing(startlocation);
call missing(endlocation);
call missing(lengthlocation);
call missing(location);
call missing(locationcheck);

call missing(startnumposts);
call missing(endnumposts);
call missing(lengthnumposts);
call missing(numposts);
call missing(numpostscheck);

call missing(startpostdatetime);
call missing(endpostdatetime);
call missing(lengthpostdatetime);
call missing(postdatetime);
call missing(postdatetimecheck);

call missing(startsignature);
call missing(endsignature);
call missing(lengthsignature);
call missing(signature);
call missing(signaturecheck);

call missing(startage);
call missing(endage);
call missing(lengthage);
call missing(age);
call missing(agecheck);

call missing(endcheck);
/**/
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

*if find(record, '<div class="postarea">') then endcheck=1;

run;

Art, CEO, AnalystFinder.com

 

 

CASLU
Fluorite | Level 6

That did it! Thank you so much, Art!--I really appreciate your help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1132 views
  • 0 likes
  • 2 in conversation