BookmarkSubscribeRSS Feed
LBLong21
Calcite | Level 5

Hi, I am trying to import a large xlsx file (more than 7000 columns) into SAS by using two methods: PROC IMPORT AND LIBNAME. And all two of them are fail. 

For the first method: PROC IMPORT, I have an error that most of the columns name changed into VARxxxx and also have missing data. (By the way my columns name is in DATE type in Excel). Here is my code: 

Options validvarname = any;
PROC IMPORT Datafile = " C:\Users\hp\Desktop\data_chung\Book1.xlsx"
DBMS = xlsx out = Test replace; Getnames = YES; RUN;

For the second method: LIBNAME, I have found this method for a several hours ago and still not really know how to use it in correctly. So the error is I can not open the file I have imported even the log show it was  successfully assigned. SAS shown me an message : " Multiple SAS files with this name exist. The table named X.RI will be opened" Then I clicked OK and then the error occured: " Access of sequential tables is not allowed in this viewer". Here is my code 

Libname x xlsx 'C:\Users\hp\Desktop\data_chung\Book1.csv' ;

How can I fix these problems? Are there any solution for me to import a large file like this ? Please help me ! 

11 REPLIES 11
Kurt_Bremser
Super User

A CSV file can't be read with the XLSX engine, CSV files are read with a data step.

Dates are not suitable for variable names, unless you work with

options validvarname=any;

Either way, I suspect that so many columns might overwhelm PROC IMPORT.

Since you seem to have a CSV file, you could create the data step dynamically if there is some logic for the variable names.

After you manage to read the data into SAS, you immediately transpose to a long dataset layout. Dates are data and must not be stored as structure. See Maxim 19.

LBLong21
Calcite | Level 5
Dear Mr. KurtBremser,
Thank you for your advise, I have a mistake with the LIBNAME code. I have fixed and run again with an option but still nothing happened.
However, I have read Maxims of Maximally Efficient SAS Programmers, Maxim 19 and I think it very useful for me as a beginner. I want to know more detail about your solution, I have understood your idea but the data that I have imported not only have problem with the variable name but also the missing data. Hope you show me more.
Kurt_Bremser
Super User

The first thing we need to know: is there any system to those 7K variable names?

If no, a quick example of the header line of your csv is needed, so we can show you how to read that and dynamically create a data step that reads the data itself.

Once that is done, we will also show you how to do the transpose.

SuCheeTay
SAS Employee
CSV files may be saved as an XLSX file and then you can use the LIBNAME XLSX engine. Alternatively, you can use INFILE and INPUT statements in a DATA step to read the CSV file and provide appropriate column names on the INPUT statement.
LBLong21
Calcite | Level 5
Dear Ms.SuCheeTay,
Thank you for your advise. I don't know how to provide appropriate column names on the INPUT statement that you have mention when I use the INFILE statement, can you show me more about it ? Cause the file is more than 7000 columns, I think it impossible for me to provide the variable name manually. Are there any MACRO ways ?
Tom
Super User Tom
Super User

Do you have an XLSX file or a CSV file? Those are totally different file formats and the right code to use depends on which type you have.

LBLong21
Calcite | Level 5
Hi Tom, I have all of them. But the original is XLSX file.
ballardw
Super User

Are ALL of the columns under date headings?

If so, are they at a regular interval, such as one column for each day, or the first day of a week or such?

 

If the columns are regular enough it may be possible to read a csv into a better structured data set. Maybe.

 

LBLong21
Calcite | Level 5

Hi ballardw, Yes all of the columns are under date headings. I am not sure about the regular interval you are mention cause I have not check it yet. But for some first columns, my data is like this:

LBLong21_0-1615731824999.png

 

Tom
Super User Tom
Super User

So you are reading some type of report?  What is the first column that didn't make it into your photograph?

What type of values to the cells that don't have "NA" in them contain?

 

You will have more control over reading a CSV file since it is just text.

Essentially you should read the dates from the first row and remember them.  You can then match them up with the values from the other rows.  Let's assume that you have two id columns (A and B in the spreadsheet) and an unknown number of columns of the actual data.  

So let's call the first two variables COLUMN1 and COLUMN2.  Then the code below will read the first row and collect the date values into a temporary array.  It will also count how many dates it found and use that to know how many values to read from each row.  Then for each row it will read the first two columns and then read each date headed column and convert the text in that column into a number in the variable VALUE.    To save some space I had it ignore the NA values.  So essentially it is creating a sparse matrix.

row col date value column1 column2 
1 1 2021-01-01 100 AA  BB
1 2 2021-01-02  105 AA BB
1 4 2021-01-04  106 AA BB
...
data tall ;
  infile 'myfile.csv' dsd truncover lrecl=1000000 ; 
  array dates [10000] _temporary_ ; 
  length row col date value 8; 
  length dummy $32 column1 column2 $200 ; 
  retain ncol 0; 
  if _n_=1 then do ; 
    input 2*dummy @; 
    do ncol=1 by 1 until(date=.); 
      input date :ddmmyy. @; 
      dates[ncol]=date; 
    end; 
    input; 
  end; 
  row+1; 
  input column1 column2 @ ; 
  do col=1 to ncol; 
    input dummy @; 
    if dummy ne 'NA' then do; 
      date=dates[col]; 
      value = input(dummy,32.); 
      output; 
    end; 
  end; 
  drop dummy; 
  format date yymmdd10.; 
run;

If you wanted to create a report that looked like your original spreadsheet you could use PROC REPORT using DATE as an across variable.

proc report data=tall;
  column row column1 column2  value,date ;
  define row / group noprint;
  define column1 / group;
  define column2 / group;
  define date / across ' ' ;
  define value / ' ';
run;

If you must read the XLSX file then read it without names. Then the values in the first row can be your column identifiers.  The presence of the NA values is probably going to mess up the conversion of the date values however.  SAS will be forced to make the column as character and the dates (if they really are dates in Excel) will be represented as strings of digits that represent the number Excel uses to store that date. So numbers in the 40K range. To convert them to dates first convert them to numbers and then adjust by the difference in how Excel and SAS start counting days.

date = input(string,32.) + '30DEC1899'd ;
format date yymmdd10.;
ballardw
Super User

@LBLong21 wrote:

Hi ballardw, Yes all of the columns are under date headings. I am not sure about the regular interval you are mention cause I have not check it yet. But for some first columns, my data is like this:

LBLong21_0-1615731824999.png

 


29/12/1989 is a Friday, 01/01/1990 is a Monday. So it appears that you might have work days, Monday through Friday. The question becomes, do you have holidays?

And are the columns supposed to hold a numeric value and NA is just a "missing" result, or is there actual meaningful text? And if text, how long are the allowable values.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Autotuning Deep Learning Models Using SAS

Follow along as SAS’ Robert Blanchard explains three aspects of autotuning in a deep learning context: globalized search, localized search and an in parallel method using SAS.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3244 views
  • 4 likes
  • 5 in conversation