BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
InspectahDex
Obsidian | Level 7

I am importing two files that are about 10 gb and 20 gb each and are pipe delimited txt files. The files and the SAS software are located on my C drive.  I also have the guessingrows option set to 1000.  Is there a better way to import these files? It's going to be almost 24 hours of running soon.

 

I've already ran the same code for files of a much smaller size (less than 10gb) and they ran fine but took about 10-15 hours.

 

Here's my code snippet for importing one of the files:

 

proc import datafile='C:\Users\Refresh_2023\Data Warehouse 20230830.txt'
out=mm.DWH
dbms=dlm
replace;
guessingrows=1000;
delimiter='|';
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@InspectahDex wrote:

Sorry, I just want to make sure I understand correctly.

 

Does this mean I should use an infile statement under the data step?

 

So it would be something like....

 

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;


Yes. Exactly.  But I would clean up the data step code.

 

Use the TRUNCOVER option on the INFILE statement in case the last value on the line is missing.

 

Include all of the variables in the LENGTH statement, not just the character ones, so they are defined in the order you want them in the dataset.

 

You do not need to put the $ in the INPUT statement if the variables have already been defined since SAS does not need to told twice that the variable is character.

 

If you define the variables in the same order as they appear in the file the INPUT statement can use a positional variable list.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  length var1 $30 var2 $5 var3 var4 8 ;
  input var1 -- var4;
  informat var3 var4 comma. ;
  format var3 var4 dollar20.2;
run;

Or you could skip the LENGTH statement and take advantage of the fact that SAS will guess to make the variable's length match the width used on an the informat specification used when the variable first appears in an INPUT statement.  Make sure to add the colon modifier so it does that read through the delimiters.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  input var1 :$30. var2 :$5. (var3 var4) (:comma.) ;
  format var3 var4 dollar20.2;
run;

 

View solution in original post

9 REPLIES 9
Tom
Super User Tom
Super User

Is there a better way to import these files? 

Yes.  Don't use PROC IMPORT.

 

Write the step to read the file yourself so you have complete control over how the variables are named and defined and how much of the data from the text file you actually need to output to a SAS dataset.

 

InspectahDex
Obsidian | Level 7

Sorry, I just want to make sure I understand correctly.

 

Does this mean I should use an infile statement under the data step?

 

So it would be something like....

 

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;

Tom
Super User Tom
Super User

@InspectahDex wrote:

Sorry, I just want to make sure I understand correctly.

 

Does this mean I should use an infile statement under the data step?

 

So it would be something like....

 

data test;

length var1 $ 30 var2 $ 5 ... ;

infile "C:....filename.txt delimiter ='|' firstobs=2 dsd;

input var1 $ var2 $ var3 var4....;

format var3 dollar20.2 var4 dollar20.2;

run;


Yes. Exactly.  But I would clean up the data step code.

 

Use the TRUNCOVER option on the INFILE statement in case the last value on the line is missing.

 

Include all of the variables in the LENGTH statement, not just the character ones, so they are defined in the order you want them in the dataset.

 

You do not need to put the $ in the INPUT statement if the variables have already been defined since SAS does not need to told twice that the variable is character.

 

If you define the variables in the same order as they appear in the file the INPUT statement can use a positional variable list.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  length var1 $30 var2 $5 var3 var4 8 ;
  input var1 -- var4;
  informat var3 var4 comma. ;
  format var3 var4 dollar20.2;
run;

Or you could skip the LENGTH statement and take advantage of the fact that SAS will guess to make the variable's length match the width used on an the informat specification used when the variable first appears in an INPUT statement.  Make sure to add the colon modifier so it does that read through the delimiters.

data test;
  infile "C:....filename.txt dlm='|' firstobs=2 dsd truncover;
  input var1 :$30. var2 :$5. (var3 var4) (:comma.) ;
  format var3 var4 dollar20.2;
run;

 

InspectahDex
Obsidian | Level 7
Wow, thank you! I tried the last datastep you shared (one w/o the length statement) and it worked and probably solved some other issues I wasn't aware of.
Reeza
Super User

Adding on to Tom's response, use PROC IMPORT to generate the code, make sure it's fine and then use it on the main file. 

Use Option OBS to limit the number of observations processed to a smaller subset of the file for the initial code generation.

 

option obs=1000000;

proc import code......run;

option obs=max;
*copy code from log to here;
ballardw
Super User

If there is any description of what the data contains you should use that to write the data step. Proc import is notorious for turning account id "numbers" into actual numeric values, which looses significant leading zero values such as from a bank routing number.

If the "document" describing the data is nice enough you can use it to add variable LABELS so people don't have to guess what a variable like Cli_0_DED_rep_pdq means.

And proper informats for date and datetime values.  Proc Import may well guess that 10/05/03 is supposed to be 5 Oct 2003 when in reality it should be 3 May 2010. If a field is a Julian Date value Proc Import is about 100% of the time going to treat it as a simple numeric field and you'll have to fix it later which you avoid by using the proper informat.

If your data has a long text field it may be the longest values are not seen in the XXX rows of guessingrows. So if the documentation says the field can hold 512 characters you can provide that in your program and not worry about truncation.

Tom
Super User Tom
Super User

It should not take that long to read a file of that size.  Make sure you are using local disk drives and are not trying to read and/or write to network disks.

 

If you really have no idea what fields are in the file you can use a more efficient tool like https://github.com/sasutils/macros/blob/master/csv2ds.sas  to read the file.

 

If that still takes too long you can use the PERCENT= parameter tell it to sample only a small percentage of the records when making its guesses about what types and lengths to use for the variables.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2283 views
  • 10 likes
  • 5 in conversation