All,
I have a data set that with embedded htlm in strings. Can I use some SAS function to translate that html?
Data set example:
data posts;
length id $5 username $10 post $250;
infile datalines delimiter=',';
input id $ username $ post $;
datalines;
338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D.
223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!
1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a> then click on the drop-down arrow to the right of your profile
;
Concrete example, after applying some SAS magic, I would like the string in the first observation "<a class=\"mention\" href=\"/users/tim\">@tim<\/a>" to get translated to "@tim".
I read about libname OLEDB here () from and but I did not get very far.
Can I borrow some code to quick fidx my data set?
Thanks,
Miguel
Hi Miguel
You can make use of Perl Regular Expression functions to do what you want. For the case you are interested you can make use of the PRXCHANGE function See sample below. The HTML tags are placed by "*". As I am not the Regex expert, I search for what I want to do, and then adapt the Regex to the appropriate SAS PRX... function.
HI:
So you explained what you want for the first row, can you explain what you would expect to get on the 2nd row and the 3rd row, too??? I do not believe you really need OLEDB, If all you are doing is extracting the string BEFORE the brackets, then you can probably use the SCAN or PRX functions.
338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D.
223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!
1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a> then click on the drop-down arrow to the right of your profile
Would you want @ahnjune for the 3rd row??? but what about the 2nd row of data????
cynthia
Hi Cynthia,
Ideally I want to avoid writing a regular expression or a string code for each case.
I was wondering if we have some way to translate html text directly. Otherwise I need to come up with all these rules myself. This data set is quite large...
Examples of rules that I would need to come up with, but are too many to even try :smileyplain:
String | Translates to |
---|---|
<a class=\"mention\" href=\"/users/FOO\">@FOO<\/a> | @FOO |
<\/p>\n\n<p> | " " |
<a href=\"https://FOO.ORG\">https://FOO.org<\/a> | FOO.org |
thanks,
M
Hi Miguel
You can make use of Perl Regular Expression functions to do what you want. For the case you are interested you can make use of the PRXCHANGE function See sample below. The HTML tags are placed by "*". As I am not the Regex expert, I search for what I want to do, and then adapt the Regex to the appropriate SAS PRX... function.
Hi Bruno,
I could not find a proc that handles html gracefully. But the regular expressions were not as bad as I thought.
You get full credit for your RegEx! it is way better than mine! And it does 99% of the job. I still get weird strings like \n\n but they are easy to remove with some SAS code.
Thanks again!
Miguel
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.