BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
M_Maldonado
Barite | Level 11

All,

I have a data set that with embedded htlm in strings. Can I use some SAS function to translate that html?

Data set example:

data posts;

   length id $5 username $10 post $250;

   infile datalines delimiter=',';

   input id $ username $ post $;

   datalines;

338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D. 

223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!

1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a>  then click on the drop-down arrow to the right of your profile

;

Concrete example, after applying some SAS magic, I would like the string in the first observation "<a class=\"mention\" href=\"/users/tim\">@tim<\/a>" to get translated to "@tim".

I read about libname OLEDB here () from and but I did not get very far.

Can I borrow some code to quick fidx my data set?

Thanks,

Miguel

1 ACCEPTED SOLUTION

Accepted Solutions
BrunoMueller
SAS Super FREQ

Hi Miguel

You can make use of Perl Regular Expression functions to do what you want. For the case you are interested you can make use of the PRXCHANGE function See sample below. The HTML tags are placed by "*". As I am not the Regex expert, I search for what I want to do, and then adapt the Regex to the appropriate SAS PRX... function.

data posts;
  length id $5 username $10 post $250;
 
infile datalines delimiter=',';
 
input id $ username $ post $;
  post2 = prxchange("s/<[^>]*>/*/ ", -1, post);
datalines4;
338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D.
223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!
1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a>  then click on the drop-down arrow to the right of your profile
;;;;

proc print;
run;

View solution in original post

4 REPLIES 4
Cynthia_sas
SAS Super FREQ

HI:

    So you explained what you want for the first row, can you explain what you would expect to get on the 2nd row and the 3rd row, too??? I do not believe you really need OLEDB, If all you are doing is extracting the string BEFORE the brackets, then you can probably use the SCAN or PRX functions.

338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D.

223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!

1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a>  then click on the drop-down arrow to the right of your profile

Would you want @ahnjune for the 3rd row??? but what about the 2nd row of data????
cynthia

M_Maldonado
Barite | Level 11

Hi Cynthia,

Ideally I want to avoid writing a regular expression or a string code for each case.

I was wondering if we have some way to translate html text directly. Otherwise I need to come up with all these rules myself. This data set is quite large...

Examples of rules that I would need to come up with, but are too many to even try :smileyplain:

StringTranslates to
<a class=\"mention\" href=\"/users/FOO\">@FOO<\/a>@FOO
<\/p>\n\n<p>" "

<a href=\"https://FOO.ORG\">https://FOO.org<\/a>

FOO.org

thanks,

M

BrunoMueller
SAS Super FREQ

Hi Miguel

You can make use of Perl Regular Expression functions to do what you want. For the case you are interested you can make use of the PRXCHANGE function See sample below. The HTML tags are placed by "*". As I am not the Regex expert, I search for what I want to do, and then adapt the Regex to the appropriate SAS PRX... function.

data posts;
  length id $5 username $10 post $250;
 
infile datalines delimiter=',';
 
input id $ username $ post $;
  post2 = prxchange("s/<[^>]*>/*/ ", -1, post);
datalines4;
338,"jessykate",<a class=\"mention\" href=\"/users/tim\">@tim<\/a> AHA!! thank you! i will try that :D.
223,"chris",Hi All <\/p>\n\n<p>Quick update on discourse - File uploads now work - Upload away!
1017,"ralfe",Hi <\/p>\n\n<p>I quite like <a class=\"mention\" href=\"/users/ahnjune\">@ahnjune<\/a> 's suggestion. If you go to <a href=\"https://p2pu.org\">https://p2pu.org<\/a>  then click on the drop-down arrow to the right of your profile
;;;;

proc print;
run;
M_Maldonado
Barite | Level 11

Hi Bruno,

I could not find a proc that handles html gracefully. But the regular expressions were not as bad as I thought.

You get full credit for your RegEx! it is way better than mine! And it does 99% of the job. I still get weird strings like \n\n but they are easy to remove with some SAS code.

Thanks again!

Miguel

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1066 views
  • 1 like
  • 3 in conversation