Hey all! I'm pretty much a novice, so anything out of the ordinary requires me to call on the help of others, which is what I'm doing.
What I want to do is to take a piece of html source code like I have below, and "clean" it such that all that remains is a list of keywords/hashtags. What I have going for me is that (for all intents and purposes) you can be pretty sure that the "#" acts as a faithful indicator that there's a keyword to follow. Any help will be greatly appreciated.
description available."}},{"node":{"comments_disabled":false,"__typename":"GraphImage","id":"1998217523958621257","edge_media_to_caption":{"edges":[{"node":{"text":"Summer is coming! \u2600\ufe0f \ud83c\udf1e\n#Freude #FreudeCrewMember #FreudeAmKFZ #VWJetta #mk2 #volkswagen #AudiS3 #Audi #VWAudi #VAG #lowered #boost #orange #red #car #tuning #summer #comingsoon #carmeet #seasoniscoming #driving #carsarelife #static @red_jetta_1.8_t"}}]},"shortcode":"Bu7GEBDg6RJ","edge_media_to_comment":{"count":0},"taken_at_timestamp":1552426113,"dimensions":{"height":1080,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/f5d30fea48bee4d2492d8fa8d28922f2/5D15DD26/t51.2885-15/e3... may contain: car and outdoor"}},{"node":{"comments_disabled":false,"__typename":"GraphImage","id":"1998201676453166113","edge_media_to_caption":{"edges":[{"node":{"text":"Ferrari FF #cars #carporn #carphotography #topcars #coolcars #automotivephography #automotivephographer #autogespot #amazing_cars #instacars #instacars247 #carsarelife #carphoting #supercarsdaily #supercars247 #supercarsdaily700 #carspotting #carspotter"}}]},"shortcode":"Bu7CdZ6lLwh","edge_media_to_comment":{"count":0},"taken_at_timestamp":1552424223,"dimensions":{"height":1080,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/bd6284b185ea7effbe3e5600d31cebca/5D0A5644/t51.2885-15/e3... may contain: car"}},{"node":{"comments_disabled":false,"__typename":"GraphVideo","id":"1998184571132976705","edge_media_to_caption":{"edges":[{"node":{"text":"Car is still his favourite word but at least he\u2019s throwing some colours in as well now \ud83d\ude02"}}]},"shortcode":"Bu6-kfWDB5B","edge_media_to_comment":{"count":4},"taken_at_timestamp":1552422264,"dimensions":{"height":750,"width":750},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/59ea755e6d8892326277c6bff273591e/5C8ADCC0/t51.2885-15/e1... Adobe Photoshop really worth it for pictures?\ud83e\udd14 #taillighttuesday \ud83d\udcf8 \u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\nCar Mods \ud83d\udd27 \n@corsaperformance | @jltperformance\n@diodedynamics | @pmas.maf \n@luxeautoconcepts | @lund_racing\n@fordperformance\n\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\nSponsors \u00a9\ufe0f\n@carlock.co | @lightingtrendz @ace_auto_detailing \u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\u25ab\ufe0f\nHashTags #\n#fordperformance #performancepackage #s550 #photography #carcommunity #fordmustanggt #brembo #carsarelife #mustangfanclub #mustangclubest #diodedynamics #mustangfanclub #s197 #mustangnation #fastlife #mustangjuice #mustang_everyday #mustangsociety #mustangsofinstagram #carporn #corsaexhaust #stanggang #mustangaddicts #fordracing #mustang_lifestyle\n\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\u25aa\ufe0f\nCredits \ud83d\udcf8 @icyy_cali (iPhone XS Max)"}}]},"shortcode":"Bu67LftlqaA","edge_media_to_comment":{"count":4},"taken_at_timestamp":1552420407,"dimensions":{"height":1350,"width":1080},"display_url":"https://scontent-lax3-1.cdninstagram.com/vp/029125e9463a896cee5c1ff2b74cdafb/5D290065/t51.2885-15/e3...
I put your html code from the docx file back into a text form and read it using this:
data have2;
infile "&sasforum.\datasets\obscure2.txt" lrecl=300000 flowover dlm=" \";
input @"#" keyword :$20. @@;
pos = notname(keyWord);
if pos > 0 then keyword = substrn(keyword,1,pos-1);
if not missing(keyword);
drop pos;
run;
I extracted 784 hashtags into dataset have2.
Here is one way of doing this. It does require reading the whole string into a character variable:
data have;
infile "&sasforum.\datasets\obscure.txt" truncover;
input str $10000.;
run;
data want;
if not prxId then prxId + prxParse("/#\w+/o");
set have;
length keyword $20;
start = 1;
stop = length(str);
call prxnext(prxId, start, stop, str, pos, len);
do while (pos > 0);
keyword = substr(str, pos, len);
output;
call prxnext(prxId, start, stop, str, pos, len);
end;
keep keyword;
run;
proc print data=want; run;
Thanks for the reply.
This looks like it is working as it should, EXCEPT it is only returning 35 results, and I know there to be about 1200.
I looked through the HAVE file and note that there are only 35 hashtags that make it in there, even though it captures much of the string before and after the large missing chunk of string.
Attached is the text file that I processed... 67 #hashtags were found.
I think the problem is that my .txt file includes HTML markup that isn't present in your .txt file (my example you copied was a snippet from the middle of the source code, whereas my .txt file includes all of the code).
When i use your .txt file I get the same results as you, and a HAVE file with one cell. When I use my .txt file I get bad results, and a HAVE file with hundreds of observations.
I am trying to figure out a way to give you the code I'm working with, but it can't attach as a file ("the contents of the attachment doesn't match its file type") and it's too long to insert as code into the body of this message.
Edit: attaching as a word document
I put your html code from the docx file back into a text form and read it using this:
data have2;
infile "&sasforum.\datasets\obscure2.txt" lrecl=300000 flowover dlm=" \";
input @"#" keyword :$20. @@;
pos = notname(keyWord);
if pos > 0 then keyword = substrn(keyword,1,pos-1);
if not missing(keyword);
drop pos;
run;
I extracted 784 hashtags into dataset have2.
This worked great!
In words (without wasting too much of your time) can you tell me what you did differently to create have2 than you did to create have?
It is completely different. This new version uses the @'value' feature of the input statement to seek words following # characters repeatedly. This overcomes the length limit of character constants (32Kb, while max lrecl is 1Gb). Strings that are read this way are then trimmed of any stray characters using function notname which finds the first char that is not alphanum or underscore.
Okay, got it!
The only surprise I'm finding is that it seems to be case sensitive, so it reads #xxx as different than #XXX, which means that if I do a count on the different values that show up in have2, I'll get 2 observations with counts for #xxx and #XXX, when they're really a single observation for my purposes. Would we just read the string in using some function that applies either capitalization or lowercase to every character, and is there a benefit of choosing one over the other?
Just add
keyword = lowcase(keyword);
or whatever. Then sort with NODUPKEY option to eliminate duplicates.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.