Hi everyone , I have scraped an article from a website . The website had paragraphs written in a series of <p> tags , as a result , the dataset I created consists of one paragraph on one line.
I want to create a variable that consists of approximately 5000 words in a column called article consisting of various paragraphs, but I am unable to figure out how to read multiple data set lines into one single column.
Example Dataset looks something like this :
1)Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
2)The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers
Expected result
para = Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;
Here's one way, I believe. There are certainly other solutions that are crafty and innovative, but I think this one is reasonably intuitive. I'd imagine you're going to need to give these some grouping ID or sequence ID if you're dealing with a lot of <p> tags.
data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;
proc transpose data = paragraphs out = paragraphs_t;
var story;
run;
data paragraphs_want;
length want_catx $800.; /* Set arbitrary length - CATX defaults to 200, which won't fit your needs. */
set paragraphs_t;
want_catx = catx(". ", col1, col2);
run;
Obs want_catx 1 Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
Here's one way, I believe. There are certainly other solutions that are crafty and innovative, but I think this one is reasonably intuitive. I'd imagine you're going to need to give these some grouping ID or sequence ID if you're dealing with a lot of <p> tags.
data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;
proc transpose data = paragraphs out = paragraphs_t;
var story;
run;
data paragraphs_want;
length want_catx $800.; /* Set arbitrary length - CATX defaults to 200, which won't fit your needs. */
set paragraphs_t;
want_catx = catx(". ", col1, col2);
run;
Obs want_catx 1 Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.