Re: Read in super large tab delimited .txt files

Jamie_H · Posted 10-19-2020 07:31 AM

Morning all,

I have a daily .txt file being sent to a network drive and need to start interrogating it. The file is very wide - over 15k variables - and about 20k rows long.

Of the ridiculously large set of variables available, I only need to work with up to maybe 500 or so. And I will know what they are before I start working on the file.

Is there a way to import a file of this size and the drop/keep type restrictions running whilst the file is being read in, so it speeds up the process? I know I can use a Drop on the "Out" of a Proc Import, but this takes about an hour to run for each days file on our machines, so i was looking for any suggestions for speeding it up.

For reference, I am on SAS EG 9.4

Thanks

Kurt_Bremser · Posted 10-19-2020 07:48 AM

It all starts with proper documentation. With proper documentation of the infile structure, you can write the data step from that, or read the documentation file itself and create the code automatically. In this phase, you can use a reference dataset to include only those variables that need to be kept.

15k variables points to some kind of transposition. Can't the source deliver the untransposed data instead?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jamie_H · Posted 10-19-2020 07:59 AM

Thanks.

The file comes from a external 3rd party and so any adaptation of the file comes at a hefty cost from them - which apparently isnt in the budget

ballardw · Posted 10-19-2020 09:43 AM

You don't actually need to "drop" or "keep" hundreds of variables if you write your input data step correctly.

You can create one variable, I usually name it dummy of character type and length 1. Then read the all the variables you do not need into that one variable. Then you only need to drop one variable.

Your input statement can stop at the last variable you need, no need to read in variables that you do not want.

The one thing that you do not want to do is rely on Proc Import for routine reading of files over time. There will come a time that the result will have different lengths or variable types because of the way that Proc Import has to "guess" at variable properties each time the file is read.

RichardDeVen · Posted 10-19-2020 11:26 AM

There are a couple of issues that can lead to slowness

File size
- What is it ? For 8 characters per field with a delimiter you have ~2.7GB file.
Network file
- Resource contention or administrative throttling
Proc IMPORT
- What code are you using to import?
- Some secret sauce of Proc IMPORT is that it runs EFI code that 'discovers' the data file structure and the discovery is likely the highest consumer of your precious time
Faster? INPUT statement
- A 'delivered' data file is likely highly structured or has a fixed structure.
- Fixed structure - code a DATA step with appropriate INPUT statement to read the data fields. This will run astonishingly faster than Proc IMPORT.

If the data file is highly structured but slightly variable in terms of columns you can write code to generate a reader program (similar to EFI but much faster due to expected regularity of the data file)

Example:

(code to create a wide data file)

Spoiler

data _null_;
  file 'c:\temp\wide.txt' lrecl=2000000;

  put 'id' @;
  do _n_ = 1 to 15000;
    name=cats('x',_n_);
    put ',' name +(-1) @;
  end;
  put;

  k = 1e5;

  do id = 1 to 200;
    put id +(-1) @;
    array x(15000);
    do _n_ = 1 to dim(x);
      k + 1;
      x(_n_) = k;
      put ',' x(_n_) +(-1) @;
    end;
    put;
  end;
  drop k;
run;

data _null_; file 'c:\temp\wide.txt' lrecl=2000000; put 'id' @; do _n_ = 1 to 15000; name=cats('x',_n_); put ',' name +(-1) @; end; put; k = 1e5; do id = 1 to 200; put id +(-1) @; array x(15000); do _n_ = 1 to dim(x); k + 1; x(_n_) = k; put ',' x(_n_) +(-1) @; end; put; end; drop k; run;

Code to write a reader

filename wide 'c:\temp\wide.txt' lrecl=200000;

* takes a LONG time (say 1 minute);
/*
proc import datafile=wide replace out=want(KEEP=id x1000-x1500) dbms=dlm;
  delimiter = ',';
run;
*/

* takes way shorter time (say .25 seconds or 300x faster);

filename reader temp;

data _null_;
  file reader;
  put 
    'DATA want(KEEP=id x1000-x1500);'
  / '  INFILE wide firstobs=2 dlm="," missover;'
  / '  INPUT'
  ;

  * read header line with field names;
  infile wide dlm=',' missover;
  length name $32;
  do until (missing(name));
    input name @;
    put name;   * write to reader program being dynamically generated;
  end;

  put '; run;';

  stop;
run;

%include reader;

filename reader clear;

Read in super large tab delimited .txt files