Solved: Re: unable to get all variables data from a .csv in SAS

saraphdnz · Posted 04-10-2022 11:03 PM

Hello there,

I am trying to import a CSV file with 5000 rows and 8000 columns in SAS and the problem is that the SAS data file is showing data only for the first 3062 variables. The rest of the columns are empty and the variables' names changed to var3064 var3065,...

the data set has characters in the first three columns and then the rest of the 8k columns contain daily data with dates as variable names.

Please advise what should I add or change to get all the variables data with their variable names in SAS.

Regards, Sara

PROC IMPORT OUT= WORK.TEST 
            DATAFILE= "C:\Users\Return_Daily.csv" 
            DBMS=CSV REPLACE;
     	    guessingrows=max; 
			delimiter=',';
			GETNAMES=YES;
RUN;

NOTE: WORK.TEST data set was successfully created.
NOTE: The data set WORK.TEST has 4472 observations and 8137 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           5:40.47
      cpu time            5:37.70

Tom · Posted 04-12-2022 10:42 AM

That link does not go to an EXCEL file. Which is good because this discussion was about reading a CSV file, which is text file format that is totally independent of EXCEL. EXCEL has a binary format for the files it creates and reading that in SAS requires a completely different set of code.

To tell what is in the file just LOOK at it as text, Do NOT open CSV files in EXCEL (at least do not let it automatically open the file) as it will make assumptions and change the data.

You can just look at the file using SAS code if you want.

75    data _null_;
76      infile "c:\downloads\Return_Daily.csv" obs=2 lrecl=1000000 ;
77      input;
78      list;
79    run;

NOTE: A byte-order mark in the file "c:\downloads\Return_Daily.csv" (for fileref "#LN00058") indicates that the data is encoded in
      "utf-8".  This encoding will be used to process the file.
NOTE: The infile "c:\downloads\Return_Daily.csv" is:
      Filename=c:\downloads\Return_Daily.csv,
      RECFM=V,LRECL=4000000,
      File Size (bytes)=77178111,
      Last Modified=12Apr2022:10:25:43,
      Create Time=12Apr2022:10:25:17

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         ID,1/01/2000,2/01/2000,3/01/2000,4/01/2000,5/01/2000,6/01/2000,7/01/2000,8/01/2000,9/01/2000,10/01/2
     101  000,11/01/2000,12/01/2000,13/01/2000,14/01/2000,15/01/2000,16/01/2000,17/01/2000,18/01/2000,19/01/20
     201  00,20/01/2000,21/01/2000,22/01/2000,23/01/2000,24/01/2000,25/01/2000,26/01/2000,27/01/2000,28/01/200
     301  0,29/01/2000,30/01/2000,31/01/2000,1/02/2000,2/02/2000,3/02/2000,4/02/2000,5/02/2000,6/02/2000,7/02/
...

So it does look like the dates is represented in the file in DMY order (there is no month 17) so the DDMMYY informat is the proper one to read those strings. Make sure to save the value into the array after reading it and retrieve the value for the current column before writing an observation.

Make the ARRAY large enough to hold the dates for all of the columns in your file. Your example file has 8132 variables. So make the array have at least that many elements.

88    data names;
89      infile "c:\downloads\Return_Daily.csv" dsd obs=1 lrecl=1000000 ;
90      input name :$32. @@;
91    run;

NOTE: The data set WORK.NAMES has 8132 observations and 1 variables.

And the values in some of the cells later in the file, line 53 for example have commas in them. So use the COMMA informat when reading the values.

data want;
  length id $20 col date value 8 ;
  format date yymmdd10.;
  infile "c:\downloads\Return_Daily.csv"  DSD truncover lrecl=1000000 ;
  array dates [10000] _temporary_;
  input id @;
  if _n_=1 then do;
     do ncol=1 by 1 until(missing(date));
      input date :ddmmyy. @ ;
      dates[ncol]=date;
     end;
     ncol=ncol-1;
     retain ncol;
  end;
  else do col=1 to ncol;
    input value :comma. @;
    date = dates[col];
    output;
  end;
run;

NOTE: 4473 records were read from the infile "c:\downloads\Return_Daily.csv".
      The minimum record length was 8137.
      The maximum record length was 87035.
NOTE: The data set WORK.WANT has 36361832 observations and 5 variables.

Note using either DMY or MDY order to display date strings will confuse half of your audience. Use DATE or YYMMDD format instead.

proc print data=want(obs=10) ;
  where not missing(value);
run;

Results

    Obs      id      col          date    value    ncol

       1    FE0001      1    2000-01-01     9.44    8131
       2    FE0001      2    2000-01-02     9.44    8131
       3    FE0001      3    2000-01-03     9.56    8131
       4    FE0001      4    2000-01-04     9.60    8131
       5    FE0001      5    2000-01-05     9.64    8131
       6    FE0001      6    2000-01-06     9.65    8131
       7    FE0001      7    2000-01-07     9.84    8131
       8    FE0001      8    2000-01-08     9.84    8131
       9    FE0001      9    2000-01-09     9.84    8131
      10    FE0001     10    2000-01-10     9.92    8131

You can probably drop the NCOL variable as it has a constant value.

You might not need to COL variable either.

The maximum length of your ID values in THIS file is only 6. So if that is representative of all future files of this type then you might be able to define the ID variable as shorter than the 20 byte length this code used.

View solution in original post

Patrick · Posted 04-11-2022 02:32 AM

Cause and resolution likely similar to the discussion here.

Kurt_Bremser · Posted 04-11-2022 04:15 AM

Do NOT use PROC IMPORT for CSV files. During its evaluation phase, it is limited by the usual limits of SAS language elements, in particular the 32K limit for character variables. If the list of variable names exceeds this, additional variables will not be read.

have the data source transpose the data (which you have to do anyway as soon as you have the data in SAS) before sending it to you.
write your own code that creates the DATA step by reading the variable names piece by piece from the first line of the CSV, and PUTting the code to a file which you later include with %INCLUDE. Thankfully, there are no limits for the size of a statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 04-11-2022 07:54 AM

Could provide a sample CSV file ?

saraphdnz · Posted 04-12-2022 07:23 AM

here is the link to access my sample CSV file

https://drive.google.com/file/d/1T_Xo9bVHDuRq0lUNtVUwxwD6juRfJitX/view?usp=sharing

Regards, Sara

Kurt_Bremser · Posted 04-12-2022 08:47 AM

@saraphdnz wrote:

here is the link to access my sample CSV file

https://drive.google.com/file/d/1T_Xo9bVHDuRq0lUNtVUwxwD6juRfJitX/view?usp=sharing

Regards, Sara

I uploaded this file to my SAS On Demand, and then ran this code:

%let infile=~/Return_Daily.csv;

data varnames;
infile "&infile." dlm="," obs=1 truncover;
length n name $32;
do until (n = "");
  input n @;
  if n = "ID"
  then name = n;
  else name = "_" !! translate(n,"_","/");
  output;
end;
run;

filename my_step "~/manyvars.sas";

data _null_;
set varnames end=done;
file my_step;
if _n_ = 1
then put "
  data wide;
  infile '&infile.' dlm=',' truncover dsd firstobs=2;
  input
";
if name = "ID"
then put "ID $";
else put name;
if done then put "
  ;
  run;
";
run;

%include my_step;

proc transpose data=wide out=long (rename=(col1=return));
by id;
var _:;
run;

data want;
retain id date return; /* for nice column order only */
set long;
date = input(translate(substr(_name_,2),"/","_"),ddmmyy10.);
format date yymmdd10.;
drop _name_;
run;

The first step reads the variable names (mostly dates) from the first line, translates them to valid SAS names, and writes them to a dataset.

The second step uses this file to create a program file which will read the data from the file, then this program file is included, running the DATA step contained in it.

Next I transpose the wide to a long dataset, and finally the column names are converted to dates.

The resulting dataset has 13697736 observations with variables ID, date and return.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

saraphdnz · Posted 04-13-2022 02:46 AM

Hello Kurt,

Thanks for the macro; it worked for my CSV file.

Much appreciated.

Regards,

Sara

Tom · Posted 04-11-2022 09:44 AM

DATES do not make good variable names. DATE values are NUMERIC. Variable names are CHARACTER.

You probably will want to write your own data step to read that file. Then you can read the dates into variables instead.

So let's assume you have one ID variable and then an unknown number of numeric variables with DATE values as the column header. So read the dates from the first row and then attach them to the values read from the other rows. For example something like this

data want;
  length id $20 col date value 8 ;
  format date yymmdd10.;
  infile "C:\Users\Return_Daily.csv" dsd truncover lrecl=1000000 ;
  array dates [1000] _temporary_;
  input id @;
  if _n_=1 then do;
     do ncol=1 by 1 until(missing(date));
      input date :date9. @ ;
     end;
     ncol=ncol-1;
      retain ncol;
  end;
  else do col=1 to ncol;
      input value @;
      output;
   end;
run;

saraphdnz · Posted 04-12-2022 07:21 AM

Hello Tom,

Thanks for the code.

I use the code and can see data in SAS now but somehow the date column is blank. Please can you check my excel file via below link and advise.

https://drive.google.com/file/d/1T_Xo9bVHDuRq0lUNtVUwxwD6juRfJitX/view?usp=sharing

Regards, Sara

data want;
  length id $20 col date value 8 ;
  format date DDMMYY10.;
  infile "C:\Users\Return_Daily.csv" DLM=',' DSD truncover lrecl=1000000 ;
  array dates [1000] _temporary_;
  input id @;
  if _n_=1 then do;
     do ncol=1 by 1 until(missing(date));
      input date :DDMMYY10. @ ; /*when I use date9. WORK.WANT has 0 observations and 5 variables*/
     end;
     ncol=ncol-1;
      retain ncol;
  end;
  else do col=1 to ncol;
      input value @;
      output;
   end;
run;

NOTE: 4473 records were read from the infile "C:\Users\Return_Daily.csv".
The minimum record length was 8141.
The maximum record length was 87038.
NOTE: The data set WORK.WANT has 36361832 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 57.30 seconds
cpu time 41.76 seconds

@Tom wrote:

DATES do not make good variable names. DATE values are NUMERIC. Variable names are CHARACTER.

You probably will want to write your own data step to read that file. Then you can read the dates into variables instead.

So let's assume you have one ID variable and then an unknown number of numeric variables with DATE values as the column header. So read the dates from the first row and then attach them to the values read from the other rows. For example something like this
data want;
  length id $20 col date value 8 ;
  format date yymmdd10.;
  infile "C:\Users\Return_Daily.csv" dsd truncover lrecl=1000000 ;
  array dates [1000] _temporary_;
  input id @;
  if _n_=1 then do;
     do ncol=1 by 1 until(missing(date));
      input date :date9. @ ;
     end;
     ncol=ncol-1;
      retain ncol;
  end;
  else do col=1 to ncol;
      input value @;
      output;
   end;
run;

Tom · Posted 04-12-2022 10:42 AM

That link does not go to an EXCEL file. Which is good because this discussion was about reading a CSV file, which is text file format that is totally independent of EXCEL. EXCEL has a binary format for the files it creates and reading that in SAS requires a completely different set of code.

To tell what is in the file just LOOK at it as text, Do NOT open CSV files in EXCEL (at least do not let it automatically open the file) as it will make assumptions and change the data.

You can just look at the file using SAS code if you want.

75    data _null_;
76      infile "c:\downloads\Return_Daily.csv" obs=2 lrecl=1000000 ;
77      input;
78      list;
79    run;

NOTE: A byte-order mark in the file "c:\downloads\Return_Daily.csv" (for fileref "#LN00058") indicates that the data is encoded in
      "utf-8".  This encoding will be used to process the file.
NOTE: The infile "c:\downloads\Return_Daily.csv" is:
      Filename=c:\downloads\Return_Daily.csv,
      RECFM=V,LRECL=4000000,
      File Size (bytes)=77178111,
      Last Modified=12Apr2022:10:25:43,
      Create Time=12Apr2022:10:25:17

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         ID,1/01/2000,2/01/2000,3/01/2000,4/01/2000,5/01/2000,6/01/2000,7/01/2000,8/01/2000,9/01/2000,10/01/2
     101  000,11/01/2000,12/01/2000,13/01/2000,14/01/2000,15/01/2000,16/01/2000,17/01/2000,18/01/2000,19/01/20
     201  00,20/01/2000,21/01/2000,22/01/2000,23/01/2000,24/01/2000,25/01/2000,26/01/2000,27/01/2000,28/01/200
     301  0,29/01/2000,30/01/2000,31/01/2000,1/02/2000,2/02/2000,3/02/2000,4/02/2000,5/02/2000,6/02/2000,7/02/
...

So it does look like the dates is represented in the file in DMY order (there is no month 17) so the DDMMYY informat is the proper one to read those strings. Make sure to save the value into the array after reading it and retrieve the value for the current column before writing an observation.

Make the ARRAY large enough to hold the dates for all of the columns in your file. Your example file has 8132 variables. So make the array have at least that many elements.

88    data names;
89      infile "c:\downloads\Return_Daily.csv" dsd obs=1 lrecl=1000000 ;
90      input name :$32. @@;
91    run;

NOTE: The data set WORK.NAMES has 8132 observations and 1 variables.

And the values in some of the cells later in the file, line 53 for example have commas in them. So use the COMMA informat when reading the values.

data want;
  length id $20 col date value 8 ;
  format date yymmdd10.;
  infile "c:\downloads\Return_Daily.csv"  DSD truncover lrecl=1000000 ;
  array dates [10000] _temporary_;
  input id @;
  if _n_=1 then do;
     do ncol=1 by 1 until(missing(date));
      input date :ddmmyy. @ ;
      dates[ncol]=date;
     end;
     ncol=ncol-1;
     retain ncol;
  end;
  else do col=1 to ncol;
    input value :comma. @;
    date = dates[col];
    output;
  end;
run;

NOTE: 4473 records were read from the infile "c:\downloads\Return_Daily.csv".
      The minimum record length was 8137.
      The maximum record length was 87035.
NOTE: The data set WORK.WANT has 36361832 observations and 5 variables.

Note using either DMY or MDY order to display date strings will confuse half of your audience. Use DATE or YYMMDD format instead.

proc print data=want(obs=10) ;
  where not missing(value);
run;

Results

    Obs      id      col          date    value    ncol

       1    FE0001      1    2000-01-01     9.44    8131
       2    FE0001      2    2000-01-02     9.44    8131
       3    FE0001      3    2000-01-03     9.56    8131
       4    FE0001      4    2000-01-04     9.60    8131
       5    FE0001      5    2000-01-05     9.64    8131
       6    FE0001      6    2000-01-06     9.65    8131
       7    FE0001      7    2000-01-07     9.84    8131
       8    FE0001      8    2000-01-08     9.84    8131
       9    FE0001      9    2000-01-09     9.84    8131
      10    FE0001     10    2000-01-10     9.92    8131

You can probably drop the NCOL variable as it has a constant value.

You might not need to COL variable either.

The maximum length of your ID values in THIS file is only 6. So if that is representative of all future files of this type then you might be able to define the ID variable as shorter than the 20 byte length this code used.

saraphdnz · Posted 04-12-2022 10:13 PM

Thanks, Tom - you are correct, it was a typo this is about CSV.

saraphdnz · Posted 04-13-2022 02:48 AM

Thank heaps for the code and explanation. I am now able to import the data in SAS successfully.

Regards,

Sara

SAS Training: Just a Click Away