DATA Step, Macro, Functions and more

Mismatched data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Mismatched data

Hello, 

 

I am a SAS beginner and am experiencing an issue that I'm not sure how to resolve. Part of the problem is my lack of technical vocabulary to even explain what I think it happening. Hopefully you can help me!

 

I'm trying to extract data from an internet forum, but several of my columns are mismatched to the proper user, where it seems that it's attaching the html for the previous case to the next one.

 
For example, the html is structured like this:
  
User1name
User1Reply#
User1postdate
 
User2name
User2Reply#
User2postdate
etc...
 
But when it gathers the data into a spreadsheet it shows user2 having user1's info, and all the following ones are behind. For example:
 
User2name           User1Reply#            User1postdate
User3name           User2Reply#            User2postdate
 
This is the result of using 2 different programs; the first one is a series of find commands, while the second is a retain command to get all data into one spreadsheet; however, I'm not sure which program needs to be adjusted or how to fix it. 
 
Hopefully this makes sense. Thanks in advance for your help!

Accepted Solutions
Solution
‎04-01-2018 12:05 PM
PROC Star
Posts: 8,096

Re: Mismatched data

I looked at the text file you originally posted. It contains multiples of the same web page.

 

I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:

data MMMdata.uberdrivermessagepagetest; 
/* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */
filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt';
infile fileref length=len;
input record $varying10000. len; 
file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

retain 
 startreplynum
 endreplynum
 lengthreplynum
 replynum
 replynumcheck

 startposter
 endposter
 lengthposter
 poster
 postercheck
 
 startstache
 endstache
 lengthstache
 stache
 stachecheck

 startlocation
 endlocation
 lengthlocation
 location
 locationcheck

 startnumposts
 endnumposts
 lengthnumposts
 numposts
 numpostscheck

 startpostdatetime
 endpostdatetime
 lengthpostdatetime
 postdatetime
 postdatetimecheck

 startsignature
 endsignature
 lengthsignature
 signature
 signaturecheck

 startage
 endage
 lengthage
 age
 agecheck
 ;
*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
/**/
endcheck=1;
output;
call missing(startreplynum);
call missing(endreplynum);
call missing(lengthreplynum);
call missing(replynum);
call missing(replynumcheck);

call missing(startposter);
call missing(endposter);
call missing(lengthposter);
call missing(poster);
call missing(postercheck);

call missing(startstache);
call missing(endstache);
call missing(lengthstache);
call missing(stache);
call missing(stachecheck);

call missing(startlocation);
call missing(endlocation);
call missing(lengthlocation);
call missing(location);
call missing(locationcheck);

call missing(startnumposts);
call missing(endnumposts);
call missing(lengthnumposts);
call missing(numposts);
call missing(numpostscheck);

call missing(startpostdatetime);
call missing(endpostdatetime);
call missing(lengthpostdatetime);
call missing(postdatetime);
call missing(postdatetimecheck);

call missing(startsignature);
call missing(endsignature);
call missing(lengthsignature);
call missing(signature);
call missing(signaturecheck);

call missing(startage);
call missing(endage);
call missing(lengthage);
call missing(age);
call missing(agecheck);

call missing(endcheck);
/**/
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

*if find(record, '<div class="postarea">') then endcheck=1;

run;

Art, CEO, AnalystFinder.com

 

 

View solution in original post


All Replies
PROC Star
Posts: 8,096

Re: Mismatched data

You'd have to post both your code and the html file. More than likely you're simply outputting each record upon capturing the data for the next record.

 

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 6

Re: Mismatched data

[ Edited ]

Art, thanks for your reply. I've attached the html file.

 

Here is the code for the first program:

 

data MMMdata. uberdrivermessagepagetest; 
filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/';
infile fileref length=len;
input record $varying10000. len; 
file "c:\MMM\uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

if find(record, '<div class="postarea">') then endcheck=1;

run;

 

 

And here is the code for the retain program:

 

data MMMdata. uberdriverclean; 
set MMMdata. uberdrivermessagepagetest; 

format replynumtemp best12. avatarflagtemp best12.; 
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp; 
final=0;

format replynumtemp best12. avatarflagtemp best12.;
retain replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp; 
final=0;
if replynum ne "" then replynumtemp=replynum; 
if avatarflag ne "" then avatarflagtemp=avatarflag;
if poster ne "" then postertemp=poster;
if stache ne "" then stachetemp=stache;
if location ne "" then locationtemp=location;
if numposts ne "" then numpoststemp=numposts;
if postdatetime ne "" then postdatetimetemp=postdatetime;
if signature ne "" then signaturetemp=signature;
if age ne "" then agetemp=age;

if endcheck ne "" then do;
endchecktemp=endchek;

final=1;
end;

if final=0 then delete;
keep replynumtemp avatarflagtemp postertemp stachetemp locationtemp numpoststemp postdatetimetemp signaturetemp agetemp;

run;

 

Thank you!

PROC Star
Posts: 8,096

Re: Mismatched data

You didn't post the full code for either of the programs.

 

Post the missing lines for each program. I can only presume that they're something like:

data whatever;

  infile whaterver specs;

  input record;

 

Art, CEO, AnalystFinder.com

 

 

Occasional Contributor
Posts: 6

Re: Mismatched data

I just edited the previous post to include the beginning of each program. 

 

Thank you.

PROC Star
Posts: 8,096

Re: Mismatched data

Are you sure you posted exactly what you ran. When I ran your first datastep I got a much smaller file and all of the variables in the SAS dataset had missing values.

 

Art, CEO, AnalystFinder.com

 

Solution
‎04-01-2018 12:05 PM
PROC Star
Posts: 8,096

Re: Mismatched data

I looked at the text file you originally posted. It contains multiples of the same web page.

 

I think you may be looking at the wrong field to determine the end of a post. Try the following (of course you'll have to change the library, filenames and revert back to the url as I analyzed the text file you attached:

data MMMdata.uberdrivermessagepagetest; 
/* filename fileref url 'https://forum.mrmoneymustache.com/antimustachian-wall-of-shame-and-comedy/uber-driver/'; */
filename fileref '/folders/myfolders/xuberdrivermessagepagetest.txt';
infile fileref length=len;
input record $varying10000. len; 
file "/folders/myfolders/uberdrivermessagepagetest.txt" mod lrecl=10000;
put record;

retain 
 startreplynum
 endreplynum
 lengthreplynum
 replynum
 replynumcheck

 startposter
 endposter
 lengthposter
 poster
 postercheck
 
 startstache
 endstache
 lengthstache
 stache
 stachecheck

 startlocation
 endlocation
 lengthlocation
 location
 locationcheck

 startnumposts
 endnumposts
 lengthnumposts
 numposts
 numpostscheck

 startpostdatetime
 endpostdatetime
 lengthpostdatetime
 postdatetime
 postdatetimecheck

 startsignature
 endsignature
 lengthsignature
 signature
 signaturecheck

 startage
 endage
 lengthage
 age
 agecheck
 ;
*avatarflag=0;

if find(record,'<div class="smalltext">&#171; <strong>Reply #') ne 0 then do;
startreplynum=find(record,'<div class="smalltext">&#171; <strong>Reply #')+45;
endreplynum=find(record,' on:');
lengthreplynum=endreplynum-startreplynum;
replynum=substr(record,startreplynum,lengthreplynum);
replynumcheck=1;
end;

if find(record,'<li class="avatar">') then avatarflag = 1;

if find(record,'title="View the profile of ') ne 0 then do;
startposter=find(record,'">')+2;
endposter=find(record,'</a>');
lengthposter=endposter-startposter;
poster=substr(record,startposter,lengthposter);
postercheck=1;
end;

if find(record,'<li class="postgroup">') ne 0 then do;
startstache=find(record,'li class="postgroup">')+21;
endstache=find(record,'</li>');
lengthstache=endstache-startstache;
stache=substr(record,startstache,lengthstache);
stachecheck=1;
end;

if find(record,'<li class="postcount">Location: ') ne 0 then do;
startlocation=find(record,'Location: ')+10;
endlocation=find(record,'</li>');
lengthlocation=endlocation-startlocation;
location=substr(record,startlocation,lengthlocation);
locationcheck=1;
end;

if find(record,'<li class="postcount">Posts: ') ne 0 then do;
startnumposts=find(record,'Posts: ')+7;
endnumposts=find(record,'</li>');
lengthnumposts=endnumposts-startnumposts;
numposts=substr(record,startnumposts,lengthnumposts);
numpostscheck=1;
end;

if find(record,'<div class="smalltext">&#171; <strong>') ne 0 then do;
startpostdatetime=find(record,'on:</strong> ')+13;
endpostdatetime=find(record,' &#');
lengthpostdatetime=endpostdatetime-startpostdatetime;
postdatetime=substr(record,startpostdatetime,lengthpostdatetime);
postdatetimecheck=1;
/**/
endcheck=1;
output;
call missing(startreplynum);
call missing(endreplynum);
call missing(lengthreplynum);
call missing(replynum);
call missing(replynumcheck);

call missing(startposter);
call missing(endposter);
call missing(lengthposter);
call missing(poster);
call missing(postercheck);

call missing(startstache);
call missing(endstache);
call missing(lengthstache);
call missing(stache);
call missing(stachecheck);

call missing(startlocation);
call missing(endlocation);
call missing(lengthlocation);
call missing(location);
call missing(locationcheck);

call missing(startnumposts);
call missing(endnumposts);
call missing(lengthnumposts);
call missing(numposts);
call missing(numpostscheck);

call missing(startpostdatetime);
call missing(endpostdatetime);
call missing(lengthpostdatetime);
call missing(postdatetime);
call missing(postdatetimecheck);

call missing(startsignature);
call missing(endsignature);
call missing(lengthsignature);
call missing(signature);
call missing(signaturecheck);

call missing(startage);
call missing(endage);
call missing(lengthage);
call missing(age);
call missing(agecheck);

call missing(endcheck);
/**/
end;

if find(record,'<div class="signature" ') ne 0 then do;
startsignature=find(record,'<br /><br />')+12;
endsignature=find(record,'</div>');
lengthsignature=endsignature-startsignature;
signature=substr(record,startsignature,lengthsignature);
signaturecheck=1;
end;

if find(record,'<li class="postcount">Age: ') ne 0 then do;
startage=find(record,'Age: ')+5;
endage=find(record,'</li>');
lengthage=endage-startage;
age=substr(record,startage,lengthage);
agecheck=1;
end;

*if find(record, '<div class="postarea">') then endcheck=1;

run;

Art, CEO, AnalystFinder.com

 

 

Occasional Contributor
Posts: 6

Re: Mismatched data

That did it! Thank you so much, Art!--I really appreciate your help!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 224 views
  • 0 likes
  • 2 in conversation