Morning all,
I have a daily .txt file being sent to a network drive and need to start interrogating it. The file is very wide - over 15k variables - and about 20k rows long.
Of the ridiculously large set of variables available, I only need to work with up to maybe 500 or so. And I will know what they are before I start working on the file.
Is there a way to import a file of this size and the drop/keep type restrictions running whilst the file is being read in, so it speeds up the process? I know I can use a Drop on the "Out" of a Proc Import, but this takes about an hour to run for each days file on our machines, so i was looking for any suggestions for speeding it up.
For reference, I am on SAS EG 9.4
Thanks
It all starts with proper documentation. With proper documentation of the infile structure, you can write the data step from that, or read the documentation file itself and create the code automatically. In this phase, you can use a reference dataset to include only those variables that need to be kept.
15k variables points to some kind of transposition. Can't the source deliver the untransposed data instead?
Thanks.
The file comes from a external 3rd party and so any adaptation of the file comes at a hefty cost from them - which apparently isnt in the budget
You don't actually need to "drop" or "keep" hundreds of variables if you write your input data step correctly.
You can create one variable, I usually name it dummy of character type and length 1. Then read the all the variables you do not need into that one variable. Then you only need to drop one variable.
Your input statement can stop at the last variable you need, no need to read in variables that you do not want.
The one thing that you do not want to do is rely on Proc Import for routine reading of files over time. There will come a time that the result will have different lengths or variable types because of the way that Proc Import has to "guess" at variable properties each time the file is read.
There are a couple of issues that can lead to slowness
If the data file is highly structured but slightly variable in terms of columns you can write code to generate a reader program (similar to EFI but much faster due to expected regularity of the data file)
Example:
(code to create a wide data file)
data _null_; file 'c:\temp\wide.txt' lrecl=2000000; put 'id' @; do _n_ = 1 to 15000; name=cats('x',_n_); put ',' name +(-1) @; end; put; k = 1e5; do id = 1 to 200; put id +(-1) @; array x(15000); do _n_ = 1 to dim(x); k + 1; x(_n_) = k; put ',' x(_n_) +(-1) @; end; put; end; drop k; run;
Code to write a reader
filename wide 'c:\temp\wide.txt' lrecl=200000; * takes a LONG time (say 1 minute); /* proc import datafile=wide replace out=want(KEEP=id x1000-x1500) dbms=dlm; delimiter = ','; run; */ * takes way shorter time (say .25 seconds or 300x faster); filename reader temp; data _null_; file reader; put 'DATA want(KEEP=id x1000-x1500);' / ' INFILE wide firstobs=2 dlm="," missover;' / ' INPUT' ; * read header line with field names; infile wide dlm=',' missover; length name $32; do until (missing(name)); input name @; put name; * write to reader program being dynamically generated; end; put '; run;'; stop; run; %include reader; filename reader clear;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.