Re: Thousands of multiple columns into one single column

yeaforme · Posted 03-18-2013 08:23 PM

Hello Everyone,

Having a few issues with moving data I collected in Excel into SAS.

The first problem is that the data in Excel is on a massive scale (about 66,000 observations across 10 variables where each variable captures a non-chronologically sorted time series. The way I collected the data, I set up each variable into a folder of spreadsheets [so 10 folders] where each folder contains 66 workbooks [1,000 observations per each workbook]). This isn't too big of a problem because once I find a good way to get a single workbook into SAS I can easily just create a for loop and automate the rest.

The second (and biggest) problem is the way the data is structured. As a reference, the data looks like the following:

ID	Amount
140011	451
140012	167	153	34	N/A	43	193
14002C	N/A
140031	142	45

And the format I need it to be in is the following:

ID	Amount
140011	451
140012	167
140012	153
140012	34
140012	N/A
140012	43
140012	193
14002C	N/A
140031	142
140031	45

A few things to note:

1 - In the first table, a blank means the time series ends.

2 - In the first table, N/A means that the data point for that specific point in time is missing, however, I need to keep it intact so that when I combine it with the other 9 variables I can get the time series' lined up correctly (and the variable may not be missing for the other 9 variables). There is an exception to this though. If the first column is N/A (for example, 14002C) then it might mean that there is no information at all (across any of the variables), or it could also mean that only the information for this specific variable is missing. In other words, if an ID just has no data in the database then the Amount column will show N/A in only that column. However, if an ID has only one piece of data but that data is missing for this variable, then the Amount column will also show N/A in only that column.

3 - The scale is massive - some of the time series have upwards of 12,000 data points (i.e., data in 12,000 Excel columns), while others have none. I've tried importing directly from Excel but had little luck as SAS seems to limit Excel imports to the first 255 columns, which is an obvious issue.

4 - There are only two column headings in the original data - one for the ID and another for the first column of the non-sorted time series data. Any more data in the time series shows up in un-labeled, subsequent columns.

5 - Some of the 10 variables are numeric (amounts), some are characters (names), and same are dates - so any code needs to be easily changeable or robust enough to accommodate different variable types.

6 - The relative order of the data needs to be retained when it moves from one row to one column as it needs to line up correctly with 9 other variables (only one of which is the date)

Any code-heads with thoughts or solutions? I've attached an example of the data (it is in .csv format, though, not the native .xlsx as the site wouldn't let me upload a .xlsx file) in the event it might help.

PGStats · Posted 03-18-2013 10:35 PM

IMHO your best bet is to transform your xlsx files into csv and to read them with something like :

data test(keep=ds_code pos amount);

infile "&sasforum.\datasets\orig58.csv" dsd missover firstobs=2 lrecl=156000;

length amountc ds_code $12;

input ds_code $ @;

do pos = 1 to 12000 until(missing(amountc));

input amountc $ @;

if not missing(amountc) then do;

if anyalpha(amountc) then amount = .;

else amount = input(amountc, 12.);

output;

end;

run;

PG

yeaforme · Posted 03-19-2013 04:17 PM

Correct me if I am wrong, but this code only moves the data into SAS, it does not reformat it, right?

PGStats · Posted 03-19-2013 05:15 PM

It does reformat the data into three columns: ds_code, pos (optional), and amount. - PG

PG

yeaforme · Posted 03-20-2013 01:10 AM

Thanks PG - the code works like a charm!

What's the best way to augment it to read in 66 csv files (named orig1 through orig 66) for each of the 10 variables?

yeaforme · Posted 03-20-2013 06:10 PM

So I came up with the following based upon PG's code. I'm not nearly the code-wizard as many are here, so if someone could look it over and tell me if there are issues, I'd appreciate it.

%Let varName = Prc;

%Let csvPath = C:\Data\Prc;

filename myCSV "&csvPath\*.csv";

data &varName(keep=DS_Code Pos &varName);

infile myCSV dsd missover firstobs=2 lrecl=156000;

length amountc DS_Code $12;

input DS_Code $ @;

do Pos = 1 to 20000 until(missing(amountc));

input amountc $ @;

if not missing(amountc) then do;

if anyalpha(amountc) then &varName = .;

else &varName = input(amountc, 12.);

output;

end;

run;

PGStats · Posted 03-20-2013 08:04 PM

It should work, except for the firstobs option which would only skip the first line from the first file. Use the eov option instead, like this:

%Let varName = Prc;

%Let csvPath = C:\Data\Prc;

filename myCSV "&csvPath\*.csv";

data &varName(keep=DS_Code Pos &varName);

infile myCSV dsd missover eov=skip lrecl=156000;

length amountc DS_Code $12;

input DS_Code $ @;

if skip then skip=0;

else do;

do Pos = 1 to 20000 until(missing(amountc));

input amountc $ @;

if not missing(amountc) then do;

if anyalpha(amountc) then &varName = .;

else &varName = input(amountc, 12.);

output;

end;

run;

Good luck!

PG

yeaforme · Posted 03-20-2013 08:31 PM

Should I include the firstobs for that first file? If I eliminate it altogether I get one header read in once...

PGStats · Posted 03-20-2013 09:26 PM

Sorry about that. Put it back then. - PG

PG

yeaforme · Posted 03-20-2013 11:32 PM

Trying to change it to handle dates. This is the best I could come up with, but the format isn't working... Any thoughts?

data &varName(keep=DS_Code Pos &varName);

infile myCSV dsd missover firstobs=2 eov=skip lrecl=156000;

length amountc DS_Code $12;

input DS_Code $ @;

if skip then skip=0;

else do;

do Pos = 1 to 20000 until(missing(amountc));

input amountc $ @;

if not missing(amountc) then do;

&varName = input(amountc, mmddyy10.);

output;

end;

run;

yeaforme · Posted 03-23-2013 12:01 AM

Hmmm...hit another snag. Sometimes the data series has blanks in it prior to it terminating - making the do pos = 1 to 12000 until(missing(amountc)); line terminate prematurely. Anyway around this?

PGStats · Posted 03-23-2013 08:45 AM

Remove the UNTIL() clause. It will take longer to execute but will scan the entire line for data. - PG

PG

Peter_C · Posted 03-23-2013 01:51 PM

yeaforme

if the run-time is important to you alternative ways exist to recognise the "end-of-data" on a line.

The INFILE statement has options LENGTH= and COLUMN= which will give you access to the values SAS maintains for current line length and the pointer for position on the current line where the next INPUT will read data ( you can redirect it with INPUT @nnn, but not in your situation here).

The options LENGTH= and COLUMN= assign variable names to these data values.

Adapting the prototype

infile myCSV dsd missover firstobs=2 eov=skip lrecl=156000 length = _len column = _col ;

and the DO statement would become something like

do until( _col=1 or _col >= _len );

I'm never quite sure what position will be set for _COL when you have read the final column. So I suggest testing that the next position has been reset to the beginning of the next line as well as beyond the end of the current line.

If you want to know from which physical position on the CSV record the data has been read, you could use _COL, or its value before the INPUT is executed, in the loop like

do until( _col=1 or _col >= _len );

pos = _col ;

input amountc $ @;

if not missing(amountc) then do;

&varName = input(amountc, mmddyy12. );

output;

end;

If a relative column number (like excel) is more useful than the CSV file text position, assign the POS in the DO statement instead of its own statement, like

do pos=1 to _len until( _col=1 or _col >= _len );

hope you'll give it a go.

peterC

SAS Innovate 2025: Call for Content

Classroom Training Available!