BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
acemanhattan
Quartz | Level 8

Hey all! I'm pretty much a novice, so anything out of the ordinary requires me to call on the help of others, which is what I'm doing.

 

What I want to do is to take a piece of html source code like I have below, and "clean" it such that all that remains is a list of keywords/hashtags. What I have going for me is that (for all intents and purposes) you can be pretty sure that the "#" acts as a faithful indicator that there's a keyword to follow. Any help will be greatly appreciated.

 

description available."}},{"node":{"comments_disabled":false,"__typename":"GraphImage","id":"1998217523958621257","edge_media_to_caption":{"edges":[{"node":{"text":"Summer is coming! \u2600\ufe0f \ud83c\udf1e\n#Freude #FreudeCrewMember #FreudeAmKFZ #VWJetta #mk2 #volkswagen #AudiS3 #Audi #VWAudi #VAG #lowered #boost #orange #red #car #tuning #summer #comingsoon #carmeet #seasoniscoming #driving #carsarelife #static @red_jetta_1.8_t"}}]},"shortcode":"Bu7GEBDg6RJ","edge_media_to_comment":{"count":0},"taken_at_timestamp":1552426113,"dimensions":{"height":1080,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/f5d30fea48bee4d2492d8fa8d28922f2/5D15DD26/t51.2885-15/e3... may contain: car and outdoor"}},{"node":{"comments_disabled":false,"__typename":"GraphImage","id":"1998201676453166113","edge_media_to_caption":{"edges":[{"node":{"text":"Ferrari FF #cars #carporn #carphotography #topcars #coolcars #automotivephography #automotivephographer #autogespot #amazing_cars #instacars #instacars247 #carsarelife #carphoting #supercarsdaily #supercars247 #supercarsdaily700 #carspotting #carspotter"}}]},"shortcode":"Bu7CdZ6lLwh","edge_media_to_comment":{"count":0},"taken_at_timestamp":1552424223,"dimensions":{"height":1080,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/bd6284b185ea7effbe3e5600d31cebca/5D0A5644/t51.2885-15/e3... may contain: car"}},{"node":{"comments_disabled":false,"__typename":"GraphVideo","id":"1998184571132976705","edge_media_to_caption":{"edges":[{"node":{"text":"Car is still his favourite word but at least he\u2019s throwing some colours in as well now \ud83d\ude02"}}]},"shortcode":"Bu6-kfWDB5B","edge_media_to_comment":{"count":4},"taken_at_timestamp":1552422264,"dimensions":{"height":750,"width":750},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/59ea755e6d8892326277c6bff273591e/5C8ADCC0/t51.2885-15/e1... Adobe Photoshop really worth it for pictures?\ud83e\udd14 #taillighttuesday \ud83d\udcf8 \u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\nCar Mods \ud83d\udd27 \n@corsaperformance | @jltperformance\n@diodedynamics | @pmas.maf \n@luxeautoconcepts | @lund_racing\n@fordperformance\n\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\nSponsors \u00a9\ufe0f\n@carlock.co | @lightingtrendz @ace_auto_detailing \u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\nHashTags #\n#fordperformance #performancepackage #s550 #photography #carcommunity #fordmustanggt #brembo #carsarelife #mustangfanclub #mustangclubest #diodedynamics #mustangfanclub #s197 #mustangnation #fastlife #mustangjuice #mustang_everyday #mustangsociety #mustangsofinstagram #carporn #corsaexhaust #stanggang #mustangaddicts #fordracing #mustang_lifestyle\n\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\nCredits \ud83d\udcf8 @icyy_cali (iPhone XS Max)"}}]},"shortcode":"Bu67LftlqaA","edge_media_to_comment":{"count":4},"taken_at_timestamp":1552420407,"dimensions":{"height":1350,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/029125e9463a896cee5c1ff2b74cdafb/5D290065/t51.2885-15/e3...

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I put your html code from the docx file back into a text form and read it using this:

 

data have2;
infile "&sasforum.\datasets\obscure2.txt" lrecl=300000 flowover dlm=" \";
input @"#" keyword :$20. @@;
pos = notname(keyWord);
if pos > 0 then keyword = substrn(keyword,1,pos-1);
if not missing(keyword);
drop pos;
run;

I extracted 784 hashtags into dataset have2.

PG

View solution in original post

10 REPLIES 10
PGStats
Opal | Level 21

Here is one way of doing this. It does require reading the whole string into a character variable:

 

data have;
infile "&sasforum.\datasets\obscure.txt" truncover;
input str $10000.;
run;

data want;
if not prxId then prxId + prxParse("/#\w+/o");
set have;
length keyword $20;
start = 1;
stop = length(str);
call prxnext(prxId, start, stop, str, pos, len);
do while (pos > 0);
    keyword = substr(str, pos, len);
    output;
    call prxnext(prxId, start, stop, str, pos, len);
    end;
keep keyword;
run;

proc print data=want; run;


PG
acemanhattan
Quartz | Level 8

Thanks for the reply.

This looks like it is working as it should, EXCEPT it is only returning 35 results, and I know there to be about 1200.

acemanhattan
Quartz | Level 8

I looked through the HAVE file and note that there are only 35 hashtags that make it in there, even though it captures much of the string before and after the large missing chunk of string.

PGStats
Opal | Level 21

Attached is the text file that I processed... 67 #hashtags were found.

PG
acemanhattan
Quartz | Level 8

I think the problem is that my .txt file includes HTML markup that isn't present in your .txt file (my example you copied was a snippet from the middle of the source code, whereas my .txt file includes all of the code).

 

When i use your .txt file I get the same results as you, and a HAVE file with one cell. When I use my .txt file I get bad results, and a HAVE file with hundreds of observations.

 

I am trying to figure out a way to give you the code I'm working with, but it can't attach as a file ("the contents of the attachment doesn't match its file type") and it's too long to insert as code into the body of this message.

 

Edit: attaching as a word document

PGStats
Opal | Level 21

I put your html code from the docx file back into a text form and read it using this:

 

data have2;
infile "&sasforum.\datasets\obscure2.txt" lrecl=300000 flowover dlm=" \";
input @"#" keyword :$20. @@;
pos = notname(keyWord);
if pos > 0 then keyword = substrn(keyword,1,pos-1);
if not missing(keyword);
drop pos;
run;

I extracted 784 hashtags into dataset have2.

PG
acemanhattan
Quartz | Level 8

This worked great!


In words (without wasting too much of your time) can you tell me what you did differently to create have2 than you did to create have?

PGStats
Opal | Level 21

It is completely different. This new version uses the @'value' feature of the input statement to seek words following # characters repeatedly. This overcomes the length limit of character constants (32Kb, while max lrecl is 1Gb). Strings that are read this way are then trimmed of any stray characters using function notname which finds the first char that is not alphanum or underscore.

PG
acemanhattan
Quartz | Level 8

Okay, got it!


The only surprise I'm finding is that it seems to be case sensitive, so it reads #xxx as different than #XXX, which means that if I do a count on the different values that show up in have2, I'll get 2 observations with counts for #xxx and #XXX, when they're really a single observation for my purposes. Would we just read the string in using some function that applies either capitalization or lowercase to every character, and is there a benefit of choosing one over the other?

PGStats
Opal | Level 21

Just add

 

keyword = lowcase(keyword);

 

or whatever. Then sort with NODUPKEY option to eliminate duplicates.

PG

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 2385 views
  • 2 likes
  • 2 in conversation