BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
podarum
Quartz | Level 8

I'm trying to import a huge (156,052 columns) dataset in csv format.  The rows are small (maybe 100 rows).  But the isses here is that once I do my proc import the data splits the column variables into Text and Numerc, when they should all be numeric.  The first 2978 columns keeps the right column names but turns it text, the rest of the columns get named as VAR's but stay numeric...  I checked the initial source in notepad and partially in excel and it seems all fine there.  I don't know what the problem could be.  Thanks.

Here is my code (but I tried many different options too):

PROC IMPORT Datafile = 'C:\ABC\Unit_Monthly.csv'

OUT=DATA.CREA_M DBMS = dlm REPLACE; GuessingROWS=2;

Delimiter = ',';

getnames=yes;

run;

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

The following code doesn't do the date correction that Tom suggested, but does attempt to correct invalid variable names and assign the actual column headers as variable labels.  Of course, it could be easily modified to deal with dates and character variables.  However, I was more interested in simply expanding the basic method:

%let maxvar=158000 ;

%let fname='C:\art\Unit_Monthly.csv' ;

filename code2run temp;

data tall(keep=row name value) ;

  array _name_ (&maxvar) $32 _temporary_ ;

  array _rawnames_ (&maxvar) $255 _temporary_ ;

  retain row 0 nvar 0 ;

  file code2run;

  infile &fname dsd truncover col=cc length=len lrecl=2000000 end=eof;

  if _n_=1 then do col=1 by 1 until (cc > len);

     length name label $255 ;

     length outlabel $300;

     input label @;

     name=compress(translate(label,'_',' '),,'kn');

     if anydigit(name) eq 1 then name=catt("_",substr(name,1));

     if length(name) gt 32 then name=substr(name,1,32);

     _rawnames_(col)=name;

     _j_=0;

     if col gt 1 then do;

       do _i_=1 to col-1;

         if name eq _rawnames_(_i_) then _j_+1;

       end;

     end;

     if _j_ gt 0 then do;

       if length(name) gt 30 then name=substr(name,1,30);

       name=catt(name,put(_j_,z2.));

     end;

     _name_(col) = name ;

     outlabel=catx(' ','label',name,'=', label,';');

     put outlabel;

  end;

  else do col=1 by 1 until (cc > len);

    input value @;

    name = _name_(col);

    output;

  end;

  nvar = max(nvar,col);

  row+1;

  if eof then putlog 'NOTE: ' row 'observations with ' nvar 'variables.' ;

run;

proc transpose data=tall out=wide (drop=_name_) ;

  by row ;

  id name ;

  var value ;

run;

data want;

  set wide;

  %include code2run;

run;

View solution in original post

23 REPLIES 23
snoopy369
Barite | Level 11

Why do you have guessingrows=2 there?  That's probably the first reason it's not working. 

I'm also unsure how PROC IMPORT will process a 156k column dataset.  I think it defaults to LRECL=32767, which is obviously not long enough for your data.  (Also, 156k columns is a really poor structure for your data, but that's neither here nor there.)  Look at the code it creates.  You can have LRECL higher than 32767, but I don't know that PROC IMPORT Is going to properly read this in. 

Are your variable names in a consistent pattern, or is it all over the place?

podarum
Quartz | Level 8

Constant pattern.. and I have tried it without guessingrows as well.. same problem

snoopy369
Barite | Level 11

You need to post more details about your data, then.  PROC IMPORT likely won't help, so you'll need to construct a specialized import - either what Tom posted, or perhaps something simpler.

For example.

data want;

infile "blah.csv" dlm=',' lrecl=1e8 firstobs=2;

input

a1-a150000

;

run;

might work if you don't mind your variable names being a1,a2,...a150000.  You probably don't want that, but if your pattern is really easy to work out it might be easier to import it this way and then set up some renames.

Tom
Super User Tom
Super User

If you know they are all numeric then why do you even need to use PROC IMPORT?

podarum
Quartz | Level 8

What do you suggest instead ?

Tom
Super User Tom
Super User

Replace FNAME with the fileref or quoted filename of your input file.

%let maxvar=158000 ;

%let fname='C:\ABC\Unit_Monthly.csv' ;

data tall(keep=row name value) ;

  array _name_ (&maxvar) $32 _temporary_ ;

  retain row 0 nvar 0 ;

  infile &fname dsd truncover col=cc length=len lrecl=2000000 end=eof;

  if _n_=1 then do col=1 by 1 until (cc > len);

     length name $32 ;

     input name @;

     _name_(col) = name ;

  end;

  else do col=1 by 1 until (cc > len);

    input value @;

    name = _name_(col);

    output;

  end;

  nvar = max(nvar,col);

  row+1;

  if eof then put 'NOTE: ' row 'observations with ' nvar 'variables.' ;

run;

proc transpose data=tall out=wide (drop=_name_) ;

  by row ;

  id name ;

  var value ;

run;

Tom
Super User Tom
Super User

Simple method is skip the header line using FIRSTOBS=2 and just read the data, creating your own names.

You could copy the names into your program using a text editor.

Or below is program to read the names from the first line and use them to write the INPUT statement to read the file.

filename code temp;

data _null_;

  infile 'myfile.csv' obs=1 dsd truncover lrecl=2000000 ;

  length name $32 ;

  file code lrecl=80 ;

  if _n_=1 then put 'input ' @ ;

  input name @;

  if name = ' ' then do; put / ';' ; stop; end;

  else put name @ ;

run;

data want ;

  infile 'myfile.csv' firstobs=2 dsd truncover lrecl=2000000 ;

%inc code / source2 ;

run;

ballardw
Super User

podarum wrote:

The first 2978 columns keeps the right column names but turns it text, the rest of the columns get named as VAR's

This is most likely occuring because you have column headers that are 1) either exact duplicates or 2) duplicates up to the number of characters that SAS uses for variable names.

Tom's recommendation may generate more specific warnings and notes about problematic variable names. Tom's code may need an addition to replace spaces in the headers with underscores.

Tom
Super User Tom
Super User

Another possibility for that type of symptom might be that the header row has been split into two.  So there would be too few names for the number of populated columns and the row of data would include names (text strings) for the first few columns because it is reading the column names as data.

Either there is an embedded return character in one of the names, or the program that generated the headers wrapped the line when they got too long.  If the former then you might be able to get one of my sample programs to work by adding the INFILE option TERMSTR=CRLF, assuming the file was made on a Windows style machine.

snoopy369
Barite | Level 11

I'd put money on it being that the header is split because of LRECL.

podarum
Quartz | Level 8

What do you mean ?  I'm not failiar with LRECL. What can I do about it ?

snoopy369
Barite | Level 11

LRECL is the logical record length.  Google LRECL SAS and read about it.  Short answer is that if LRECL is too short (less than your actual maximum column) you won't get what you want.

Tom
Super User Tom
Super User

LRECL is Logical RECord Length.  It is old term form IBM mainframe (JCL).  It basically means how many characters can one line of text hold.  Normal default for SAS is 256 characters.  If you have 157,000 names and the names are an average of 8 characters then with delimiters you need over 1.5 million characters to write the names.  So your LRECL value on the INFILE statement would need to be something like that to prevent truncation when SAS tries to read the names.

Similarly if you are using SAS to create this file you would want to use a large value for LRECL on the FILE statement.

podarum
Quartz | Level 8

The code that Tom suggested kinda worked, but it did not take the first column which is the row's name (date).. Eg. the first column is the dates of each row...

Looks something like this (this just a made up sample)

Date           MAR10000001  MAR10000002   MAR10000003 ......   MAR10156052

2001.01            3                         18                         11                         4535

2001.02            7                          2                          15                          895

2001.03            9                          0                           0                               0                  

...

2013.09            34                         1                         56                          5646

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 1659 views
  • 3 likes
  • 6 in conversation