Solved: Re: Using PROC IMPORT to import two very large txt files is taking 20+...

InspectahDex · Posted 10-06-2023 12:50 PM

I am importing two files that are about 10 gb and 20 gb each and are pipe delimited txt files. The files and the SAS software are located on my C drive. I also have the guessingrows option set to 1000. Is there a better way to import these files? It's going to be almost 24 hours of running soon.

I've already ran the same code for files of a much smaller size (less than 10gb) and they ran fine but took about 10-15 hours.

Here's my code snippet for importing one of the files:

proc import datafile='C:\Users\Refresh_2023\Data Warehouse 20230830.txt'
out=mm.DWH
dbms=dlm
replace;
guessingrows=1000;
delimiter='|';
run;

Tom · Posted 10-06-2023 02:48 PM

@InspectahDex wrote:

Sorry, I just want to make sure I understand correctly.

Does this mean I should use an infile statement under the data step?

So it would be something like....

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;

Yes. Exactly. But I would clean up the data step code.

Use the TRUNCOVER option on the INFILE statement in case the last value on the line is missing.

Include all of the variables in the LENGTH statement, not just the character ones, so they are defined in the order you want them in the dataset.

You do not need to put the $ in the INPUT statement if the variables have already been defined since SAS does not need to told twice that the variable is character.

If you define the variables in the same order as they appear in the file the INPUT statement can use a positional variable list.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  length var1 $30 var2 $5 var3 var4 8 ;
  input var1 -- var4;
  informat var3 var4 comma. ;
  format var3 var4 dollar20.2;
run;

Or you could skip the LENGTH statement and take advantage of the fact that SAS will guess to make the variable's length match the width used on an the informat specification used when the variable first appears in an INPUT statement. Make sure to add the colon modifier so it does that read through the delimiters.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  input var1 :$30. var2 :$5. (var3 var4) (:comma.) ;
  format var3 var4 dollar20.2;
run;

View solution in original post

Tom · Posted 10-06-2023 12:56 PM

Is there a better way to import these files?

Yes. Don't use PROC IMPORT.

Write the step to read the file yourself so you have complete control over how the variables are named and defined and how much of the data from the text file you actually need to output to a SAS dataset.

InspectahDex · Posted 10-06-2023 02:37 PM

Sorry, I just want to make sure I understand correctly.

Does this mean I should use an infile statement under the data step?

So it would be something like....

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;

Tom · Posted 10-06-2023 02:48 PM

@InspectahDex wrote:

Sorry, I just want to make sure I understand correctly.

Does this mean I should use an infile statement under the data step?

So it would be something like....

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;

Yes. Exactly. But I would clean up the data step code.

Use the TRUNCOVER option on the INFILE statement in case the last value on the line is missing.

Include all of the variables in the LENGTH statement, not just the character ones, so they are defined in the order you want them in the dataset.

You do not need to put the $ in the INPUT statement if the variables have already been defined since SAS does not need to told twice that the variable is character.

If you define the variables in the same order as they appear in the file the INPUT statement can use a positional variable list.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  length var1 $30 var2 $5 var3 var4 8 ;
  input var1 -- var4;
  informat var3 var4 comma. ;
  format var3 var4 dollar20.2;
run;

Or you could skip the LENGTH statement and take advantage of the fact that SAS will guess to make the variable's length match the width used on an the informat specification used when the variable first appears in an INPUT statement. Make sure to add the colon modifier so it does that read through the delimiters.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  input var1 :$30. var2 :$5. (var3 var4) (:comma.) ;
  format var3 var4 dollar20.2;
run;

InspectahDex · Posted 10-06-2023 04:34 PM

Wow, thank you! I tried the last datastep you shared (one w/o the length statement) and it worked and probably solved some other issues I wasn't aware of.

Kurt_Bremser · Posted 10-06-2023 01:02 PM

Never use PROC IMPORT for delimited text files. Or any text files, while we're at it.

Write the DATA step yourself, according to the documentation of the file.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 10-06-2023 01:03 PM

Adding on to Tom's response, use PROC IMPORT to generate the code, make sure it's fine and then use it on the main file.

Use Option OBS to limit the number of observations processed to a smaller subset of the file for the initial code generation.

option obs=1000000;

proc import code......run;

option obs=max;
*copy code from log to here;

ballardw · Posted 10-06-2023 01:44 PM

If there is any description of what the data contains you should use that to write the data step. Proc import is notorious for turning account id "numbers" into actual numeric values, which looses significant leading zero values such as from a bank routing number.

If the "document" describing the data is nice enough you can use it to add variable LABELS so people don't have to guess what a variable like Cli_0_DED_rep_pdq means.

And proper informats for date and datetime values. Proc Import may well guess that 10/05/03 is supposed to be 5 Oct 2003 when in reality it should be 3 May 2010. If a field is a Julian Date value Proc Import is about 100% of the time going to treat it as a simple numeric field and you'll have to fix it later which you avoid by using the proper informat.

If your data has a long text field it may be the longest values are not seen in the XXX rows of guessingrows. So if the documentation says the field can hold 512 characters you can provide that in your program and not worry about truncation.

InspectahDex · Posted 10-06-2023 02:37 PM

Thank you, this is a great idea.

Tom · Posted 10-06-2023 02:22 PM

It should not take that long to read a file of that size. Make sure you are using local disk drives and are not trying to read and/or write to network disks.

If you really have no idea what fields are in the file you can use a more efficient tool like https://github.com/sasutils/macros/blob/master/csv2ds.sas to read the file.

If that still takes too long you can use the PERCENT= parameter tell it to sample only a small percentage of the records when making its guesses about what types and lengths to use for the variables.

Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Re: Using PROC IMPORT to import two very large txt files is taking 20+ hours so far. Is this normal?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!