Exclude variable reading in infile statement

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Exclude variable reading in infile statement

Hi,

I have an extra-large table (more than 1200 variables and more than 1.7 million records) in CSV format to bring into SAS.

I only need 2 variables out of this table.  I currently have a data step that will bring in everything and I keep only the 2 required variables.  However, it is highly inefficient to read and write 100% of the content when in fact I only need less than 1% of it.

Is there a way for me to read/write only the required fields and ignore all the other ones?

Thanks,


Accepted Solutions
Solution
‎02-20-2014 09:02 AM
Respected Advisor
Posts: 3,799

Re: Exclude variable reading in infile statement

One method you might consider is to read fields you are not interested in into a dummy variable.  This will "move" the input to the field of interest.  Any fields after the last field of interest are of no concern and can be ignored.  Consider this example.

data fields2;
   infile cards dsd;
  
length d $1;
  
input 3*d field1 :$16. 2*d field2;
   cards;
one,two,three-three-three,this one,five five,sixsixsix,100,these are,fields,that will,be ignored
;;;;
   run;

View solution in original post


All Replies
Solution
‎02-20-2014 09:02 AM
Respected Advisor
Posts: 3,799

Re: Exclude variable reading in infile statement

One method you might consider is to read fields you are not interested in into a dummy variable.  This will "move" the input to the field of interest.  Any fields after the last field of interest are of no concern and can be ignored.  Consider this example.

data fields2;
   infile cards dsd;
  
length d $1;
  
input 3*d field1 :$16. 2*d field2;
   cards;
one,two,three-three-three,this one,five five,sixsixsix,100,these are,fields,that will,be ignored
;;;;
   run;
Occasional Contributor
Posts: 7

Re: Exclude variable reading in infile statement

Posted in reply to data_null__

Thanks for your quick reply,

I would still be reading and writing 99.9% of a content I do not need.  I'd rather have a solution that would skip reading/writing for these variables.  I figure out if I read/write only 0.1% of the file, my processing should be way faster.

Smiley Happy

Respected Advisor
Posts: 3,799

Re: Exclude variable reading in infile statement

Do you even look a the program I wrote?  It will read very little of the file. And for writing that is only the variables you keep.

Otherwise you will need to give more details.  because just saying you only need to read .1% aint' helping.

You need to tell what you know about the fields you want to read.

Occasional Contributor
Posts: 7

Re: Exclude variable reading in infile statement

Data _null_,

Thank you for your solution.  It trimmed the importing process from more than 3 hours to less than 1.

Smiley Happy

Respected Advisor
Posts: 3,799

Re: Exclude variable reading in infile statement

That still sounds like a very long time or you must have a very slow computer.  Can you show your program?

Occasional Contributor
Posts: 7

Re: Exclude variable reading in infile statement

Posted in reply to data_null__

data fields2;     infile 'O:\RISK.CSV' delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2;   length d $1;   input 4*d RECORD_NB :$10. 5*d FICCLBV8_SCORE : $5.;   cards; ;;;   run;

Super User
Posts: 5,509

Re: Exclude variable reading in infile statement

When you are reading from a separate file, you don't need a CARDS; statement.  That can be removed.

Since you are reading the 5th and 11th variable, you won't need to expand LRECL to 32767.  You can use the default which, I believe, reads from the first 256 characters on the line.  Just remove lrecl=32767 and see what difference that makes.

Good luck.

Occasional Contributor
Posts: 7

Re: Exclude variable reading in infile statement

Posted in reply to Astounding

Hi Astounding,

It did not change much in terms of processing time.  I still have around an hour (give or take 5 minutes) to bring the finle in.

Thanks for the pointers,

Super User
Posts: 5,509

Re: Exclude variable reading in infile statement

The last attempt I would suggest is a combination of what I originally suggested plus data_null_'s enhancement.  Given that you are using pipe-delimited data, the program would be:

data fields2;

infile 'O:\RISK.CSV' firstobs=2;

length record_nb $10 ficclbv8_score $ 5;

input @;

record_nb = scan(_infile_, 5, '|', 'MQ');

ficclbv8_score = scan(_infile_, 11, '|', 'MQ');

run;

With fingers suitably crossed ...


Occasional Contributor
Posts: 7

Re: Exclude variable reading in infile statement

Posted in reply to Astounding

Hi Astounding,

Processing time is still slightly higher than the original solution.

Thank you for your suggestion. Smiley Happy

Super User
Posts: 5,509

Re: Exclude variable reading in infile statement

Here's an untested approach that may be faster.  Only testing will tell for sure.

data want;

  infile csvfile lrecl=5000;

  input @;

  charvar = scan(_infile_, 97, ',', 'M');

  numvar = input( scan(_infile_, 1104, ',' 'M'), 8.);

run;

The example assumes a few things.  CHARVAR is the 97th variable reading from left to right, and NUMVAR is the 1104th variable.  Also, ALL commas get treated as delimiters.  If you have any other commas in the raw data (such as within text strings, for example), this approach won't work.

Good luck.

Respected Advisor
Posts: 3,799

Re: Exclude variable reading in infile statement

Posted in reply to Astounding
q or Q

ignores delimiters that are inside of substrings that are enclosed in quotation
marks.
If the value of the string argument contains unmatched quotation marks, then scanning
from left to right will produce different words than scanning from right to left.
Super User
Posts: 11,343

Re: Exclude variable reading in infile statement

I don't know quite what you mean about "write" 100% of the data. If you use something like:

data want (keep= var1 var2);

     infile ...

     input ...

run;

then only the two variables you want will be written to the dataset as read. Yes the temporary buffer as each line is read contains the variables up to the second one you want (the input statement shouldn't include any variables past the ones you want to reduce the size of the input buffer), but that is ONE record in memory at a time not written to the disk.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 716 views
  • 0 likes
  • 4 in conversation