BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Niala
Calcite | Level 5

Hi,

I have an extra-large table (more than 1200 variables and more than 1.7 million records) in CSV format to bring into SAS.

I only need 2 variables out of this table.  I currently have a data step that will bring in everything and I keep only the 2 required variables.  However, it is highly inefficient to read and write 100% of the content when in fact I only need less than 1% of it.

Is there a way for me to read/write only the required fields and ignore all the other ones?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

One method you might consider is to read fields you are not interested in into a dummy variable.  This will "move" the input to the field of interest.  Any fields after the last field of interest are of no concern and can be ignored.  Consider this example.

data fields2;
   infile cards dsd;
  
length d $1;
  
input 3*d field1 :$16. 2*d field2;
   cards;
one,two,three-three-three,this one,five five,sixsixsix,100,these are,fields,that will,be ignored
;;;;
   run;

View solution in original post

13 REPLIES 13
data_null__
Jade | Level 19

One method you might consider is to read fields you are not interested in into a dummy variable.  This will "move" the input to the field of interest.  Any fields after the last field of interest are of no concern and can be ignored.  Consider this example.

data fields2;
   infile cards dsd;
  
length d $1;
  
input 3*d field1 :$16. 2*d field2;
   cards;
one,two,three-three-three,this one,five five,sixsixsix,100,these are,fields,that will,be ignored
;;;;
   run;
Niala
Calcite | Level 5

Thanks for your quick reply,

I would still be reading and writing 99.9% of a content I do not need.  I'd rather have a solution that would skip reading/writing for these variables.  I figure out if I read/write only 0.1% of the file, my processing should be way faster.

Smiley Happy

data_null__
Jade | Level 19

Do you even look a the program I wrote?  It will read very little of the file. And for writing that is only the variables you keep.

Otherwise you will need to give more details.  because just saying you only need to read .1% aint' helping.

You need to tell what you know about the fields you want to read.

Niala
Calcite | Level 5

Data _null_,

Thank you for your solution.  It trimmed the importing process from more than 3 hours to less than 1.

Smiley Happy

data_null__
Jade | Level 19

That still sounds like a very long time or you must have a very slow computer.  Can you show your program?

Niala
Calcite | Level 5

data fields2;     infile 'O:\RISK.CSV' delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2;   length d $1;   input 4*d RECORD_NB :$10. 5*d FICCLBV8_SCORE : $5.;   cards; ;;;   run;

Astounding
PROC Star

When you are reading from a separate file, you don't need a CARDS; statement.  That can be removed.

Since you are reading the 5th and 11th variable, you won't need to expand LRECL to 32767.  You can use the default which, I believe, reads from the first 256 characters on the line.  Just remove lrecl=32767 and see what difference that makes.

Good luck.

Niala
Calcite | Level 5

Hi Astounding,

It did not change much in terms of processing time.  I still have around an hour (give or take 5 minutes) to bring the finle in.

Thanks for the pointers,

Astounding
PROC Star

The last attempt I would suggest is a combination of what I originally suggested plus data_null_'s enhancement.  Given that you are using pipe-delimited data, the program would be:

data fields2;

infile 'O:\RISK.CSV' firstobs=2;

length record_nb $10 ficclbv8_score $ 5;

input @;

record_nb = scan(_infile_, 5, '|', 'MQ');

ficclbv8_score = scan(_infile_, 11, '|', 'MQ');

run;

With fingers suitably crossed ...


Niala
Calcite | Level 5

Hi Astounding,

Processing time is still slightly higher than the original solution.

Thank you for your suggestion. Smiley Happy

Astounding
PROC Star

Here's an untested approach that may be faster.  Only testing will tell for sure.

data want;

  infile csvfile lrecl=5000;

  input @;

  charvar = scan(_infile_, 97, ',', 'M');

  numvar = input( scan(_infile_, 1104, ',' 'M'), 8.);

run;

The example assumes a few things.  CHARVAR is the 97th variable reading from left to right, and NUMVAR is the 1104th variable.  Also, ALL commas get treated as delimiters.  If you have any other commas in the raw data (such as within text strings, for example), this approach won't work.

Good luck.

data_null__
Jade | Level 19
q or Q

ignores delimiters that are inside of substrings that are enclosed in quotation
marks.
If the value of the string argument contains unmatched quotation marks, then scanning
from left to right will produce different words than scanning from right to left.
ballardw
Super User

I don't know quite what you mean about "write" 100% of the data. If you use something like:

data want (keep= var1 var2);

     infile ...

     input ...

run;

then only the two variables you want will be written to the dataset as read. Yes the temporary buffer as each line is read contains the variables up to the second one you want (the input statement shouldn't include any variables past the ones you want to reduce the size of the input buffer), but that is ONE record in memory at a time not written to the disk.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 5368 views
  • 0 likes
  • 4 in conversation