- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Attached I have a dataset of orders with columns of free text with some containing multiple lines.
I would like to concatenate the rows, grouping by the ORDER_ID. I looked at this post, https://communities.sas.com/t5/SAS-Programming/Concatenate-multiple-rows-into-a-single-value/td-p/13... but it's not concatenating how I want. The records are already in order by LINE where applicable.
data orders;
infile datalines delimiter=',';
input ORDER_ID NOTE_CSN_ID TEXT $ LINE;
datalines;
1011533,44561,apple,1
1011536,44562,apple,1
1011536,44562,orange,2
1058001,44563,apple,1
1058001,44563,banana,2
1011569,44564,apple,1
1011569,44564,peanut,2
1011569,44564,pear,3
;
run;
Have:
ORDER_ID | NOTE_CSN_ID | TEXT | LINE |
---|---|---|---|
1011533 | 44561 | apple | 1 |
1011536 | 44562 | apple | 1 |
1011536 | 44562 | orange | 2 |
1058001 | 44563 | apple | 1 |
1058001 | 44563 | banana | 2 |
1011569 | 44564 | apple | 1 |
1011569 | 44564 | peanut | 2 |
1011569 | 44564 | pear | 3 |
Want:
ORDER_ID | NOTE_CSN_ID | TEXT |
1011533 | 44561 | apple |
1011536 | 44562 | apple, orange |
1058001 | 44563 | apple, banana |
1011569 | 44564 | apple, peanut, pear |
Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PharmlyDoc wrote:
I set the length to 3,500 just in case. 😀
length cat $3500.;
I kept getting an error when I didn't specify notsorted.
Is there a difference between
by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID LINE notsorted; ?
Of course not. If you didn't sort your example data these are not in sorted order for Order_id:
1058001,44563,banana,2 1011569,44564,apple,1
NOTSORTED always applies to ALL variables. You only need it once and position does not matter.
From the documentation on the BY statement
statements. Tips The NOTSORTED option can appear anywhere in the BY statement.
And my "solution"
data want; set orders; by order_id NOTE_CSN_ID notsorted ; length cat $ 2000; retain cat; if first.NOTE_CSN_ID then cat=text; else cat=catx(',',cat,TEXT); if last.NOTE_CSN_ID then output; run;
If you have a large data set you might try comparing different approaches. I've always felt that the executable SET called multiple times for a single data set for this sort of problem runs slower but isn't obvious with small sets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There are multiple different solutions in that linked thread, any of which should work for you.
Can you please post the code attempted that does not work and explain how it did not work? If those solutions do not work, it's possible there's something in the data that necessitates a different type of solution.
EDIT: here's my referenced example on this type of problem https://gist.github.com/statgeek/d583cfa992bf56da51d435165b07e96a
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps there is something wrong with my data. The dataset that I am applying this approach has "TEXT" column cells containing as many as 2000 characters.
data orders_grouped;
retain ORDER_ID NOTE_CSN_ID TEXT;
length cat $30.;
do until (last.ORDER_ID);
set orders;
by ORDER_ID NOTE_CSN_ID LINE notsorted;
cat=catx(',',cat,TEXT);
end;
drop TEXT LINE;
rename cat=TEXT;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
length cat $30.;
The length of the final variable to is set to be 30 but each row can have as many as 2000 characters? The length is too short and the variable is being truncated to the first 30 characters only.
Do you need LINE unsorted in your BY statement? I doubt it will change things but just something I noticed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I set the length to 3,500 just in case. 😀
length cat $3500.;
I kept getting an error when I didn't specify notsorted.
Is there a difference between
by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID LINE notsorted; ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PharmlyDoc wrote:
I set the length to 3,500 just in case. 😀
length cat $3500.;
I kept getting an error when I didn't specify notsorted.
Is there a difference between
by ORDER_ID notsorted NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID notsorted LINE ;
versus
by ORDER_ID NOTE_CSN_ID LINE notsorted; ?
Of course not. If you didn't sort your example data these are not in sorted order for Order_id:
1058001,44563,banana,2 1011569,44564,apple,1
NOTSORTED always applies to ALL variables. You only need it once and position does not matter.
From the documentation on the BY statement
statements. Tips The NOTSORTED option can appear anywhere in the BY statement.
And my "solution"
data want; set orders; by order_id NOTE_CSN_ID notsorted ; length cat $ 2000; retain cat; if first.NOTE_CSN_ID then cat=text; else cat=catx(',',cat,TEXT); if last.NOTE_CSN_ID then output; run;
If you have a large data set you might try comparing different approaches. I've always felt that the executable SET called multiple times for a single data set for this sort of problem runs slower but isn't obvious with small sets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tried your solution, but just like with my solution, for some reason SAS is combining text from different ORDER_IDs in my real dataset. The TEXT column for the real data contains doctor's notes (free text) for procedure orders, which is extracted from the electronic health record, and is broken into lines of 2000 characters or less. So a note for a procedure order could consist of multiple lines of 2000 characters each.
Each ORDER_ID has a unique NOTE_CSN_ID.
In the real data, my first 3 rows, which correspond to a single ORDER_ID and NOTE_CSN_ID, but when I attempt to concatenate the 3 rows, the result pulls in free text from a different ORDER_ID/NOTE_CSN_ID.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PharmlyDoc wrote:
I tried your solution, but just like with my solution, for some reason SAS is combining text from different ORDER_IDs in my real dataset. The TEXT column for the real data contains doctor's notes (free text) for procedure orders, which is extracted from the electronic health record, and is broken into lines of 2000 characters or less. So a note for a procedure order could consist of multiple lines of 2000 characters each.
Which may be very problematic to begin with. SAS variables are limited to 32K characters. So if you have more than 16 'lines' you are going to exceed the variable length.
Provide an example of the exact data with the problem you describe, at least the Order_id and Note_csn_id fields along with some generic text. And show the actual code that you ran. Amazingly enough, sometimes when people try to implement code on real data sets with different variables they sometimes miss one of the details that is needed to implement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So it looks like your solution may be working after all.
I am using SAS Enterprise Guide and proc sql to query an Oracle server to produce the orders dataset. However, when I view the results in SASEG (after the query completes) row 2 (LINE 2) of the first ORDER_ID is missing a huge string of text, and is even missing when I export the orders dataset to a csv file and view it in Excel. But after concatenating and producing the "want" dataset, the missing string reappears where it's supposed to be when viewing it in Excel.
If I proc print data=orders (obs=3) ; run; --the string in line 2 appears.
So now I'm trying to figure out why that string of text is missing when I export the orders dataset to a csv file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Make sure that your libname statement or connect statement sets option dbmaxtext=32767
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/acreldb/n1aqglg4ftdj04n1eyvh2l3367ql.htm
Excel does "stuff" to data. If you want to see what's really "in it" use a text editor like Notepad++