BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kaziumair
Quartz | Level 8

Hi everyone , I have scraped an article from a website . The website had paragraphs written in a series of <p> tags , as a result , the dataset I created consists of one paragraph on one line.

I want to create a variable that consists of approximately 5000 words in a column called article consisting of various paragraphs, but I am unable to figure out how to read multiple data set lines into one single column.

 

Example Dataset looks something like this :
1)Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday

2)The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers

 

Expected result

para =  Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.

data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;

 

1 ACCEPTED SOLUTION

Accepted Solutions
maguiremq
SAS Super FREQ

Here's one way, I believe. There are certainly other solutions that are crafty and innovative, but I think this one is reasonably intuitive. I'd imagine you're going to need to give these some grouping ID or sequence ID if you're dealing with a lot of <p> tags.

 

data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;

proc transpose data = paragraphs out = paragraphs_t;
	var story;
run;

data paragraphs_want;
	length want_catx $800.; /* Set arbitrary length - CATX defaults to 200, which won't fit your needs. */
	set paragraphs_t;
	want_catx = catx(". ", col1, col2); 
run;
Obs want_catx 
1 Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers. 

 

View solution in original post

2 REPLIES 2
maguiremq
SAS Super FREQ

Here's one way, I believe. There are certainly other solutions that are crafty and innovative, but I think this one is reasonably intuitive. I'd imagine you're going to need to give these some grouping ID or sequence ID if you're dealing with a lot of <p> tags.

 

data paragraphs;
input story & $800.;
datalines;
Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday
The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers.
;

proc transpose data = paragraphs out = paragraphs_t;
	var story;
run;

data paragraphs_want;
	length want_catx $800.; /* Set arbitrary length - CATX defaults to 200, which won't fit your needs. */
	set paragraphs_t;
	want_catx = catx(". ", col1, col2); 
run;
Obs want_catx 
1 Alphabet's Google has reached licensing deals with over 600 news outlets around the world and is seeing a "huge increase" in users requesting more content from specific publications as part of a new programme, it said on Wednesday. The update comes as big Internet service providers including Facebook have been locked in bitter disputes over fair compensation to publishers. 

 

kaziumair
Quartz | Level 8
Thank you for your help

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 782 views
  • 0 likes
  • 2 in conversation