11-10-2014 04:54 PM
I have a CSV file that requires 2 rounds of parsing, and I cannot figure out how to do it:
1) parse out rows that are all in one cell, currently separated by an Alt-Enter, into their own rows (and cells);
2) parse out comma-separated-values in these new rows into their own columns.
Please refer to the attached JPGs of file as a guide to illustrate what I am trying to accomplish. There are 2 worksheets: “Incoming data” shows what I receive (in CSV format), and “Goal” shows where I need to end up (preferably in Excel format):
1. In “Incoming data” worksheet, column E (“Comments”) contains data that is on multiple lines separated by Alt-Enter, but within the same cell. Data in each row within the cell is separated by commas, always in the following order:
2. Column E needs to be parsed out so that each row within cell E2 is in its own cell (and therefore row), and the data in columns A-D and F-I are duplicated to the new rows.
3. Column E then needs to be further parsed out so that each comma separated value is in its own cell, under its new column heading. See worksheet “Goal” in attached file to illustrate my final desired goal.
Whether step 2 comes before 3 or vice versa, I don’t care. As long as the data ends up as per “Goal”, I am happy.
Additional comments on the CSV data file:
1. The CSV file will be ultimately be used in MS Excel 2010 for further processing.
2. The CSV file will be up to 5000 rows long, but the number of columns will stay the same.
3. In the ‘Incoming Data’ Comments column, there could be up to 15 separate rows within one cell, each row separated by an Alt-Enter. This example only shows three.
4. The “Disputed Amount” will undoubtedly be sometimes entered with a $ sign, and sometimes without. I need the macro to work either way.
5. The parsed data as per the “Goal” worksheet should be in Excel format, if possible.
Many thanks to any and all who choose to assist!
11-10-2014 05:19 PM
Oh... I assumed when I entered that info into my SAS Profile, it would be automatically visible to others.
OS: Windows 7 Enterprise, 32 bit
SAS: EG 6.1 M1
The system didn't let me upload CSV or Excel, so I posted some screenshots of my data in my original post.
Also, I no longer see an option to attach files at this point.
11-10-2014 06:39 PM
Use the advanced editor (upper right hand light text of Reply) to attach sample data.
CSV doesn't have "cells", so a sample is required to fully understand what you're talking about.
11-11-2014 07:26 AM
That is because your data contains some line carriage character( i.e. '0D0A'x ).
And better use HEX. check what exactly these special character are .
data have; id=1; comment=catx('0D0A'x,'saas','sdfsd','xcxd','vdfew');output; id=2; comment=catx('0D0A'x,'wwe','ghf','bvd','azx');output; run; data want; set have; do i=1 to countw(comment,'0D0A'x); com=scan(comment,i,'0D0A'x);output; end; drop i comment; run;
Message was edited by: xia keshan
11-11-2014 08:31 AM
Just made a quick check (Win7, Excel 2010).
A spreadsheet with Alt-Enter linebreaks in the cells will, when saved as .csv, have hex 0A as linebreaks in the cells and hex 0D0A as record separator.
Unfortunately, both (0A and 0D0A) are interpreted by SAS to be record separators, so you may need to do some cleaning before reading into SAS, or try to read the file raw:
read into a buffer variable until you detect a 0D0A sequence
split into fields using the delimiter of the csv
then split the multiple-line cell using 0A as delimiter
output for each item in the multiple-line cell
delete everything from the buffer up to and including the 0D0A
11-11-2014 08:56 AM
Not sure if it helps, but I had some code from Support for dealing with 0D0A characters: https://communities.sas.com/thread/60374
11-11-2014 09:03 AM
Forget my previous posting.
infile "....." termstr=CRLF dlm=';' dsd;
and read the fields. Then split the multiple line cell along the 0A's with the scan function.
Caution: Will not work if file is transferred to say, UNIX, as ASCII and the CRLF's are replaced with LF's.
11-20-2014 11:05 AM
I appreciate everyone's input, but it is clear to me that I need more help here.
I don't even know HOW to implement your suggestions. I was expecting answers that sounded something like "Open the Query Builder and choose such and such an option..."
When I read:
infile "....." termstr=CRLF dlm=';' dsd;
...it tells me that I am in over my head. How would I even start "using" this type of code?
I use SAS for its easy visual method of building queries. I don't code anything. I may just try to figure out my problem directly in Excel.
11-20-2014 11:52 AM
If you post sample data someone can help with code. A picture of the data isn't helpful, to provide code someone would have to generate fake data, which in this case is a bit of work, so instead you get ideas of the solution.
My solution would be a standard import and then parse it out with the scan function because it looks like your data follows a specific structure and you know you have a max amount (15) so it would relatively straight forward.
If you choose to do it in Excel/manually look at the Data>Text to Columns feature to separate out the fields.
Good Luck either way
11-20-2014 12:04 PM
Ah, sorry. Just assumed you were a SAS programmer, didn't read the Data Management discussion group. If that is the case, I would recommend passing it to your programmers to sort out. Whilst you could do things in Excel, it definately wouldn't be my recommendation. For instance, how would you validate it? not sure of your position, but if that data is used in any kind of important step and there is problems... Your programmers (I would hope) have processes for this.
11-20-2014 01:55 PM
From your comments, I take it that you have both the Excel xlsx file and a copy saved as a CSV file.
If your site licenses the product that allows you to import Excel files you could just open a program window, paste the code below, and run it. The only two things you would have to change is where I used datafile="c:\art\test1.xlsx" to be the location and name of your excel workbook and, where I indicated sheet="Sheet1"; to indicate the actual name of the worksheet you have:
proc import datafile="c:\art\test1.xlsx" out=thedata replace dbms=excel;
data want (drop=comments i);
do while (scan(comments,i,'0A'x) ne "");