Solved: Re: importing makes data into text and numeric

podarum · Posted 11-20-2013 11:46 AM

I'm trying to import a huge (156,052 columns) dataset in csv format. The rows are small (maybe 100 rows). But the isses here is that once I do my proc import the data splits the column variables into Text and Numerc, when they should all be numeric. The first 2978 columns keeps the right column names but turns it text, the rest of the columns get named as VAR's but stay numeric... I checked the initial source in notepad and partially in excel and it seems all fine there. I don't know what the problem could be. Thanks.

Here is my code (but I tried many different options too):

PROC IMPORT Datafile = 'C:\ABC\Unit_Monthly.csv'

OUT=DATA.CREA_M DBMS = dlm REPLACE; GuessingROWS=2;

Delimiter = ',';

getnames=yes;

run;

art297 · Posted 11-20-2013 06:12 PM

The following code doesn't do the date correction that Tom suggested, but does attempt to correct invalid variable names and assign the actual column headers as variable labels. Of course, it could be easily modified to deal with dates and character variables. However, I was more interested in simply expanding the basic method:

%let maxvar=158000 ;

%let fname='C:\art\Unit_Monthly.csv' ;

filename code2run temp;

data tall(keep=row name value) ;

array _name_ (&maxvar) $32 _temporary_ ;

array _rawnames_ (&maxvar) $255 _temporary_ ;

retain row 0 nvar 0 ;

file code2run;

infile &fname dsd truncover col=cc length=len lrecl=2000000 end=eof;

if _n_=1 then do col=1 by 1 until (cc > len);

length name label $255 ;

length outlabel $300;

input label @;

name=compress(translate(label,'_',' '),,'kn');

if anydigit(name) eq 1 then name=catt("_",substr(name,1));

if length(name) gt 32 then name=substr(name,1,32);

_rawnames_(col)=name;

_j_=0;

if col gt 1 then do;

do _i_=1 to col-1;

if name eq _rawnames_(_i_) then _j_+1;

end;

if _j_ gt 0 then do;

if length(name) gt 30 then name=substr(name,1,30);

name=catt(name,put(_j_,z2.));

end;

_name_(col) = name ;

outlabel=catx(' ','label',name,'=', label,';');

put outlabel;

end;

else do col=1 by 1 until (cc > len);

input value @;

name = _name_(col);

output;

end;

nvar = max(nvar,col);

row+1;

if eof then putlog 'NOTE: ' row 'observations with ' nvar 'variables.' ;

run;

proc transpose data=tall out=wide (drop=_name_) ;

by row ;

id name ;

var value ;

run;

data want;

set wide;

%include code2run;

run;

View solution in original post

snoopy369 · Posted 11-20-2013 12:22 PM

Why do you have guessingrows=2 there? That's probably the first reason it's not working.

I'm also unsure how PROC IMPORT will process a 156k column dataset. I think it defaults to LRECL=32767, which is obviously not long enough for your data. (Also, 156k columns is a really poor structure for your data, but that's neither here nor there.) Look at the code it creates. You can have LRECL higher than 32767, but I don't know that PROC IMPORT Is going to properly read this in.

Are your variable names in a consistent pattern, or is it all over the place?

podarum · Posted 11-20-2013 12:56 PM

Constant pattern.. and I have tried it without guessingrows as well.. same problem

snoopy369 · Posted 11-20-2013 01:04 PM

You need to post more details about your data, then. PROC IMPORT likely won't help, so you'll need to construct a specialized import - either what Tom posted, or perhaps something simpler.

For example.

data want;

infile "blah.csv" dlm=',' lrecl=1e8 firstobs=2;

input

a1-a150000

;

run;

might work if you don't mind your variable names being a1,a2,...a150000. You probably don't want that, but if your pattern is really easy to work out it might be easier to import it this way and then set up some renames.

Tom · Posted 11-20-2013 12:39 PM

If you know they are all numeric then why do you even need to use PROC IMPORT?

podarum · Posted 11-20-2013 12:42 PM

What do you suggest instead ?

Tom · Posted 11-20-2013 12:54 PM

Replace FNAME with the fileref or quoted filename of your input file.

%let maxvar=158000 ;

%let fname='C:\ABC\Unit_Monthly.csv' ;

data tall(keep=row name value) ;

array _name_ (&maxvar) $32 _temporary_ ;

retain row 0 nvar 0 ;

infile &fname dsd truncover col=cc length=len lrecl=2000000 end=eof;

if _n_=1 then do col=1 by 1 until (cc > len);

length name $32 ;

input name @;

_name_(col) = name ;

end;

else do col=1 by 1 until (cc > len);

input value @;

name = _name_(col);

output;

end;

nvar = max(nvar,col);

row+1;

if eof then put 'NOTE: ' row 'observations with ' nvar 'variables.' ;

run;

proc transpose data=tall out=wide (drop=_name_) ;

by row ;

id name ;

var value ;

run;

Tom · Posted 11-20-2013 01:03 PM

Simple method is skip the header line using FIRSTOBS=2 and just read the data, creating your own names.

You could copy the names into your program using a text editor.

Or below is program to read the names from the first line and use them to write the INPUT statement to read the file.

filename code temp;

data _null_;

infile 'myfile.csv' obs=1 dsd truncover lrecl=2000000 ;

length name $32 ;

file code lrecl=80 ;

if _n_=1 then put 'input ' @ ;

input name @;

if name = ' ' then do; put / ';' ; stop; end;

else put name @ ;

run;

data want ;

infile 'myfile.csv' firstobs=2 dsd truncover lrecl=2000000 ;

%inc code / source2 ;

run;

ballardw · Posted 11-20-2013 01:28 PM

podarum wrote:

The first 2978 columns keeps the right column names but turns it text, the rest of the columns get named as VAR's

This is most likely occuring because you have column headers that are 1) either exact duplicates or 2) duplicates up to the number of characters that SAS uses for variable names.

Tom's recommendation may generate more specific warnings and notes about problematic variable names. Tom's code may need an addition to replace spaces in the headers with underscores.

Tom · Posted 11-20-2013 01:36 PM

Another possibility for that type of symptom might be that the header row has been split into two. So there would be too few names for the number of populated columns and the row of data would include names (text strings) for the first few columns because it is reading the column names as data.

Either there is an embedded return character in one of the names, or the program that generated the headers wrapped the line when they got too long. If the former then you might be able to get one of my sample programs to work by adding the INFILE option TERMSTR=CRLF, assuming the file was made on a Windows style machine.

snoopy369 · Posted 11-20-2013 01:44 PM

I'd put money on it being that the header is split because of LRECL.

podarum · Posted 11-20-2013 01:46 PM

What do you mean ? I'm not failiar with LRECL. What can I do about it ?

snoopy369 · Posted 11-20-2013 02:04 PM

LRECL is the logical record length. Google LRECL SAS and read about it. Short answer is that if LRECL is too short (less than your actual maximum column) you won't get what you want.

Tom · Posted 11-20-2013 02:05 PM

LRECL is Logical RECord Length. It is old term form IBM mainframe (JCL). It basically means how many characters can one line of text hold. Normal default for SAS is 256 characters. If you have 157,000 names and the names are an average of 8 characters then with delimiters you need over 1.5 million characters to write the names. So your LRECL value on the INFILE statement would need to be something like that to prevent truncation when SAS tries to read the names.

Similarly if you are using SAS to create this file you would want to use a large value for LRECL on the FILE statement.

podarum · Posted 11-20-2013 03:28 PM

The code that Tom suggested kinda worked, but it did not take the first column which is the row's name (date).. Eg. the first column is the dates of each row...

Looks something like this (this just a made up sample)

Date MAR10000001 MAR10000002 MAR10000003 ...... MAR10156052

2001.01 3 18 11 4535

2001.02 7 2 15 895

2001.03 9 0 0 0

...

2013.09 34 1 56 5646

Registration is open

SAS Training: Just a Click Away