BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
PharmlyDoc
Quartz | Level 8

Attached I have a dataset of orders with columns of free text with some containing multiple lines. 

I would like to concatenate the rows, grouping by the ORDER_ID. I looked at this post, https://communities.sas.com/t5/SAS-Programming/Concatenate-multiple-rows-into-a-single-value/td-p/13...  but it's not concatenating how I want. The records are already in order by LINE where applicable.

 

data orders;
infile datalines delimiter=',';
input ORDER_ID NOTE_CSN_ID TEXT $ LINE;
datalines;
1011533,44561,apple,1
1011536,44562,apple,1
1011536,44562,orange,2
1058001,44563,apple,1
1058001,44563,banana,2
1011569,44564,apple,1
1011569,44564,peanut,2
1011569,44564,pear,3
;
run;

 

Have:

ORDER_ID NOTE_CSN_ID TEXT LINE
1011533 44561 apple 1
1011536 44562 apple 1
1011536 44562 orange 2
1058001 44563 apple 1
1058001 44563 banana 2
1011569 44564 apple 1
1011569 44564 peanut 2
1011569 44564 pear 3

 

Want:

ORDER_ID NOTE_CSN_ID TEXT
1011533 44561 apple
1011536 44562 apple, orange
1058001 44563 apple, banana
1011569 44564 apple, peanut, pear

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@PharmlyDoc wrote:

I set the length to 3,500 just in case. 😀

length cat $3500.;

 

I kept getting an error when I didn't specify notsorted. 

Is there a difference between 

by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID LINE notsorted; ?


Of course not. If you didn't sort your example data these are not in sorted order for Order_id:

1058001,44563,banana,2
1011569,44564,apple,1

NOTSORTED always applies to ALL variables. You only need it once and position does not matter.

From the documentation on the BY statement

statements.
Tips The NOTSORTED option can appear anywhere in the BY statement.

And my "solution"

data want;
   set orders;
   by order_id NOTE_CSN_ID  notsorted ;
   length cat $ 2000;
   retain cat;
   if first.NOTE_CSN_ID then cat=text;
   else cat=catx(',',cat,TEXT);  
   if last.NOTE_CSN_ID then output;
run;

If you have a large data set you might try comparing different approaches. I've always felt that the executable SET called multiple times for a single data set for this sort of problem runs slower but isn't obvious with small sets.

 

View solution in original post

9 REPLIES 9
Reeza
Super User

There are multiple different solutions in that linked thread, any of which should work for you. 


Can you please post the code attempted that does not work and explain how it did not work? If those solutions do not work, it's possible there's something in the data that necessitates a different type of solution.

 

EDIT: here's my referenced example on this type of problem https://gist.github.com/statgeek/d583cfa992bf56da51d435165b07e96a

PharmlyDoc
Quartz | Level 8

Perhaps there is something wrong with my data. The dataset that I am applying this approach has "TEXT" column cells containing as many as 2000 characters. 

 

data orders_grouped;
retain ORDER_ID NOTE_CSN_ID TEXT;
length cat $30.;
do until (last.ORDER_ID);
set orders;
by ORDER_ID NOTE_CSN_ID LINE notsorted;
cat=catx(',',cat,TEXT);
end;
drop TEXT LINE;
rename cat=TEXT;
run;

 

Reeza
Super User
length cat $30.;

 

The length of the final variable to is set to be 30 but each row can have as many as 2000 characters? The length is too short and the variable is being truncated to the first 30 characters only. 

 

Do you need LINE unsorted in your BY statement? I doubt it will change things but just something I noticed.

PharmlyDoc
Quartz | Level 8

I set the length to 3,500 just in case. 😀

length cat $3500.;

 

I kept getting an error when I didn't specify notsorted. 

Is there a difference between 

by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID LINE notsorted; ?

ballardw
Super User

@PharmlyDoc wrote:

I set the length to 3,500 just in case. 😀

length cat $3500.;

 

I kept getting an error when I didn't specify notsorted. 

Is there a difference between 

by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID notsorted LINE ;

versus

by ORDER_ID NOTE_CSN_ID LINE notsorted; ?


Of course not. If you didn't sort your example data these are not in sorted order for Order_id:

1058001,44563,banana,2
1011569,44564,apple,1

NOTSORTED always applies to ALL variables. You only need it once and position does not matter.

From the documentation on the BY statement

statements.
Tips The NOTSORTED option can appear anywhere in the BY statement.

And my "solution"

data want;
   set orders;
   by order_id NOTE_CSN_ID  notsorted ;
   length cat $ 2000;
   retain cat;
   if first.NOTE_CSN_ID then cat=text;
   else cat=catx(',',cat,TEXT);  
   if last.NOTE_CSN_ID then output;
run;

If you have a large data set you might try comparing different approaches. I've always felt that the executable SET called multiple times for a single data set for this sort of problem runs slower but isn't obvious with small sets.

 

PharmlyDoc
Quartz | Level 8

I tried your solution, but just like with my solution, for some reason SAS is combining text from different ORDER_IDs in my real dataset. The TEXT column for the real data contains doctor's notes (free text) for procedure orders, which is extracted from the electronic health record, and is broken into lines of 2000 characters or less.  So a note for a procedure order could consist of multiple lines of 2000 characters each. 

 

Each ORDER_ID has a unique NOTE_CSN_ID. 

In the real data, my first 3 rows, which correspond to a single ORDER_ID and NOTE_CSN_ID, but when I attempt to concatenate the 3 rows, the result pulls in free text from a different ORDER_ID/NOTE_CSN_ID. 

ballardw
Super User

@PharmlyDoc wrote:

I tried your solution, but just like with my solution, for some reason SAS is combining text from different ORDER_IDs in my real dataset. The TEXT column for the real data contains doctor's notes (free text) for procedure orders, which is extracted from the electronic health record, and is broken into lines of 2000 characters or less.  So a note for a procedure order could consist of multiple lines of 2000 characters each. 

 


Which may be very problematic to begin with. SAS variables are limited to 32K characters. So if you have more than 16 'lines' you are going to exceed the variable length.

 

Provide an example of the exact data with the problem you describe, at least the Order_id and Note_csn_id fields along with some generic text. And show the actual code that you ran. Amazingly enough, sometimes when people try to implement code on real data sets with different variables they sometimes miss one of the details that is needed to implement.

PharmlyDoc
Quartz | Level 8

So it looks like your solution may be working after all. 

I am using SAS Enterprise Guide and proc sql to query an Oracle server to produce the orders dataset. However, when I view the results in SASEG (after the query completes) row 2 (LINE 2) of the first ORDER_ID is missing a huge string of text, and is even missing when I export the orders dataset to a csv file and view it in Excel. But after concatenating and producing the "want" dataset, the missing string reappears where it's supposed to be when viewing it in Excel. 

 

If I proc print data=orders (obs=3) ; run; --the string in line 2 appears. 

So now I'm trying to figure out why that string of text is missing when I export the orders dataset to a csv file. 

Patrick
Opal | Level 21

@PharmlyDoc 

Make sure that your libname statement or connect statement sets option dbmaxtext=32767 
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/acreldb/n1aqglg4ftdj04n1eyvh2l3367ql.htm 

 

Patrick_0-1665276471797.png

Excel does "stuff" to data. If you want to see what's really "in it" use a text editor like Notepad++

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1320 views
  • 4 likes
  • 4 in conversation