BookmarkSubscribeRSS Feed
yeaforme
Calcite | Level 5

Hello Everyone,

Having a few issues with moving data I collected in Excel into SAS.

The first problem is that the data in Excel is on a massive scale (about 66,000 observations across 10 variables where each variable captures a non-chronologically sorted time series.  The way I collected the data, I set up each variable into a folder of spreadsheets [so 10 folders] where each folder contains 66 workbooks [1,000 observations per each workbook]).  This isn't too big of a problem because once I find a good way to get a single workbook into SAS I can easily just create a for loop and automate the rest.

The second (and biggest) problem is the way the data is structured.  As a reference, the data looks like the following:

IDAmount
140011451
14001216715334N/A43193
14002CN/A
14003114245

And the format I need it to be in is the following:

IDAmount
140011451
140012167
140012153
14001234
140012N/A
14001243
140012193
14002CN/A
140031142
14003145

A few things to note:

1 - In the first table, a blank means the time series ends.

2 - In the first table, N/A means that the data point for that specific point in time is missing, however, I need to keep it intact so that when I combine it with the other 9 variables I can get the time series' lined up correctly (and the variable may not be missing for the other 9 variables).  There is an exception to this though.  If the first column is N/A (for example, 14002C) then it might mean that there is no information at all (across any of the variables), or it could also mean that only the information for this specific variable is missing.  In other words, if an ID just has no data in the database then the Amount column will show N/A in only that column.  However, if an ID has only one piece of data but that data is missing for this variable, then the Amount column will also show N/A in only that column.

3 - The scale is massive - some of the time series have upwards of 12,000 data points (i.e., data in 12,000 Excel columns), while others have none.  I've tried importing directly from Excel but had little luck as SAS seems to limit Excel imports to the first 255 columns, which is an obvious issue.

4 - There are only two column headings in the original data - one for the ID and another for the first column of the non-sorted time series data.  Any more data in the time series shows up in un-labeled, subsequent columns.

5 - Some of the 10 variables are numeric (amounts), some are characters (names), and same are dates - so any code needs to be easily changeable or robust enough to accommodate different variable types.

6 - The relative order of the data needs to be retained when it moves from one row to one column as it needs to line up correctly with 9 other variables (only one of which is the date)

Any code-heads with thoughts or solutions?  I've attached an example of the data (it is in .csv format, though, not the native .xlsx as the site wouldn't let me upload a .xlsx file) in the event it might help.

12 REPLIES 12
PGStats
Opal | Level 21

IMHO your best bet is to transform your xlsx files into csv and to read them with something like :

data test(keep=ds_code pos amount);

infile "&sasforum.\datasets\orig58.csv" dsd missover firstobs=2 lrecl=156000;

length amountc ds_code $12;

input ds_code $ @;

do pos = 1 to 12000 until(missing(amountc));

     input amountc $ @;

     if not missing(amountc) then do;

          if anyalpha(amountc) then amount = .;

               else amount = input(amountc, 12.);

          output;

          end;

     end;

run;

PG

PG
yeaforme
Calcite | Level 5

Correct me if I am wrong, but this code only moves the data into SAS, it does not reformat it, right?

PGStats
Opal | Level 21

It does reformat the data into three columns: ds_code, pos (optional), and amount. - PG

PG
yeaforme
Calcite | Level 5

Thanks PG - the code works like a charm!

What's the best way to augment it to read in 66 csv files (named orig1 through orig 66) for each of the 10 variables?

yeaforme
Calcite | Level 5

So I came up with the following based upon PG's code.  I'm not nearly the code-wizard as many are here, so if someone could look it over and tell me if there are issues, I'd appreciate it.

%Let varName = Prc;

%Let csvPath = C:\Data\Prc;

filename myCSV "&csvPath\*.csv";

data &varName(keep=DS_Code Pos &varName);

    infile myCSV dsd missover firstobs=2 lrecl=156000;

    length amountc DS_Code $12;

    input DS_Code $ @;

    do Pos = 1 to 20000 until(missing(amountc));

        input amountc $ @;

        if not missing(amountc) then do;

            if anyalpha(amountc) then &varName = .;

                 else &varName = input(amountc, 12.);

                  output;

             end;

         end;

run;

PGStats
Opal | Level 21

It should work, except for the firstobs option which would only skip the first line from the first file. Use the eov option instead, like this:

%Let varName = Prc;

%Let csvPath = C:\Data\Prc;

filename myCSV "&csvPath\*.csv";

data &varName(keep=DS_Code Pos &varName);

    infile myCSV dsd missover eov=skip lrecl=156000;

    length amountc DS_Code $12;

    input DS_Code $ @;

if skip then skip=0;

else do;

  do Pos = 1 to 20000 until(missing(amountc));

   input amountc $ @;

   if not missing(amountc) then do;

    if anyalpha(amountc) then &varName = .;

      else &varName = input(amountc, 12.);

       output;

     end;

    end;

  end;

run;

Good luck!

PG

PG
yeaforme
Calcite | Level 5

Should I include the firstobs for that first file?  If I eliminate it altogether I get one header read in once...

PGStats
Opal | Level 21

Sorry about that. Put it back then. - PG

PG
yeaforme
Calcite | Level 5

Trying to change it to handle dates.  This is the best I could come up with, but the format isn't working...  Any thoughts?

data &varName(keep=DS_Code Pos &varName);

    infile myCSV dsd missover firstobs=2 eov=skip lrecl=156000;

    length amountc DS_Code $12;

    input DS_Code $ @;

if skip then skip=0;

else do;

  do Pos = 1 to 20000 until(missing(amountc));

   input amountc $ @;

   if not missing(amountc) then do;

    &varName = input(amountc, mmddyy10.);

       output;

     end;

    end;

  end;

run;

yeaforme
Calcite | Level 5

Hmmm...hit another snag.  Sometimes the data series has blanks in it prior to it terminating - making the do pos = 1 to 12000 until(missing(amountc)); line terminate prematurely.  Anyway around this?

PGStats
Opal | Level 21

Remove the UNTIL() clause. It will take longer to execute but will scan the entire line for data. - PG

PG
Peter_C
Rhodochrosite | Level 12

yeaforme

if the run-time is important to you alternative ways exist to recognise the "end-of-data" on a line.

The INFILE statement has options LENGTH= and COLUMN= which will give you access to the values SAS maintains for current line length and the pointer for position on the current line where the next INPUT will read data ( you can redirect it with INPUT @nnn,  but not in your situation here).

The options LENGTH= and COLUMN= assign variable names to these data values.

Adapting the prototype

    infile myCSV dsd missover firstobs=2 eov=skip lrecl=156000 length = _len column = _col ;

and the DO statement would become something like

    do  until( _col=1 or _col >= _len );

I'm never quite sure what position will be set for _COL when you have read  the final column. So I suggest testing that the next position has been reset to the beginning of the next line as well as beyond the end of the current line.

If you want to know from which physical position on the CSV record the data has been read, you could use _COL, or its value before the INPUT is executed, in the loop like


   do  until( _col=1 or _col >= _len );

     pos = _col ;

      input amountc $ @;

      if not missing(amountc) then do;

         &varName = input(amountc, mmddyy12. );

         output;

      end;

   end;

If a relative column number (like excel) is more useful than the CSV file text position, assign the POS in the DO statement instead of its own statement, like

   do pos=1 to _len  until( _col=1 or _col >= _len );

hope you'll give it a go.

peterC

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3545 views
  • 1 like
  • 3 in conversation