BookmarkSubscribeRSS Feed
Jamie_H
Fluorite | Level 6

Morning all, 

 

I have a daily .txt file being sent to a network drive and need to start interrogating it.  The file is very wide - over 15k variables - and about 20k rows long.

 

Of the ridiculously large set of variables available, I only need to work with up to maybe 500 or so.  And I will know what they are before I start working on the file. 

 

Is there a way to import a file of this size and the drop/keep type restrictions running whilst the file is being read in, so it speeds up the process?  I know I can use a Drop on the "Out" of a Proc Import, but this takes about an hour to run for each days file on our machines, so i was looking for any suggestions for speeding it up.

 

For reference, I am on SAS EG 9.4

 

Thanks

4 REPLIES 4
Kurt_Bremser
Super User

It all starts with proper documentation. With proper documentation of the infile structure, you can write the data step from that, or read the documentation file itself and create the code automatically. In this phase, you can use a reference dataset to include only those variables that need to be kept.

 

15k variables points to some kind of transposition. Can't the source deliver the untransposed data instead?

Jamie_H
Fluorite | Level 6

Thanks. 

 

The file comes from a external 3rd party and so any adaptation of the file comes at a hefty cost from them - which apparently isnt in the budget

ballardw
Super User

You don't actually need to "drop" or "keep" hundreds of variables if you write your input data step correctly.

You can create one variable, I usually name it dummy of character type and length 1. Then read the all the variables you do  not need into that one variable. Then you only need to drop one variable.

Your input statement can stop at the last variable you need, no need to read in variables that you do not want.

 

The one thing that you do not want to do is rely on Proc Import for routine reading of files over time. There will come a time that the result will have different lengths or variable types because of the way that Proc Import has to "guess" at variable properties each time the file is read.

 

 

 

RichardDeVen
Barite | Level 11

There are a couple of issues that can lead to slowness

 

  • File size
    • What is it ? For 8 characters per field with a delimiter you have ~2.7GB file.
  • Network file
    • Resource contention or administrative throttling 
  • Proc IMPORT 
    • What code are you using to import?
    • Some secret sauce of Proc IMPORT is that it runs EFI code that 'discovers' the data file structure and the discovery is likely the highest consumer of your precious time
  • Faster? INPUT statement
    • A 'delivered' data file is likely highly structured or has a fixed structure.
    • Fixed structure - code a DATA step with appropriate INPUT statement to read the data fields.  This will run astonishingly faster than Proc IMPORT.

 

If the data file is highly structured but slightly variable in terms of columns you can write code to generate a reader program (similar to EFI but much faster due to expected regularity of the data file)

 

Example:

(code to create a wide data file)

Spoiler
data _null_;
  file 'c:\temp\wide.txt' lrecl=2000000;

  put 'id' @;
  do _n_ = 1 to 15000;
    name=cats('x',_n_);
    put ',' name +(-1) @;
  end;
  put;

  k = 1e5;

  do id = 1 to 200;
    put id +(-1) @;
    array x(15000);
    do _n_ = 1 to dim(x);
      k + 1;
      x(_n_) = k;
      put ',' x(_n_) +(-1) @;
    end;
    put;
  end;
  drop k;
run;

Code to write a reader

filename wide 'c:\temp\wide.txt' lrecl=200000;

* takes a LONG time (say 1 minute);
/*
proc import datafile=wide replace out=want(KEEP=id x1000-x1500) dbms=dlm;
  delimiter = ',';
run;
*/

* takes way shorter time (say .25 seconds or 300x faster);

filename reader temp;

data _null_;
  file reader;
  put 
    'DATA want(KEEP=id x1000-x1500);'
  / '  INFILE wide firstobs=2 dlm="," missover;'
  / '  INPUT'
  ;

  * read header line with field names;
  infile wide dlm=',' missover;
  length name $32;
  do until (missing(name));
    input name @;
    put name;   * write to reader program being dynamically generated;
  end;

  put '; run;';

  stop;
run;

%include reader;

filename reader clear;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 521 views
  • 0 likes
  • 4 in conversation