Hi,
I have an extra-large table (more than 1200 variables and more than 1.7 million records) in CSV format to bring into SAS.
I only need 2 variables out of this table. I currently have a data step that will bring in everything and I keep only the 2 required variables. However, it is highly inefficient to read and write 100% of the content when in fact I only need less than 1% of it.
Is there a way for me to read/write only the required fields and ignore all the other ones?
Thanks,
One method you might consider is to read fields you are not interested in into a dummy variable. This will "move" the input to the field of interest. Any fields after the last field of interest are of no concern and can be ignored. Consider this example.
One method you might consider is to read fields you are not interested in into a dummy variable. This will "move" the input to the field of interest. Any fields after the last field of interest are of no concern and can be ignored. Consider this example.
Thanks for your quick reply,
I would still be reading and writing 99.9% of a content I do not need. I'd rather have a solution that would skip reading/writing for these variables. I figure out if I read/write only 0.1% of the file, my processing should be way faster.
Do you even look a the program I wrote? It will read very little of the file. And for writing that is only the variables you keep.
Otherwise you will need to give more details. because just saying you only need to read .1% aint' helping.
You need to tell what you know about the fields you want to read.
Data _null_,
Thank you for your solution. It trimmed the importing process from more than 3 hours to less than 1.
That still sounds like a very long time or you must have a very slow computer. Can you show your program?
data fields2; infile 'O:\RISK.CSV' delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2; length d $1; input 4*d RECORD_NB :$10. 5*d FICCLBV8_SCORE : $5.; cards; ;;; run;
When you are reading from a separate file, you don't need a CARDS; statement. That can be removed.
Since you are reading the 5th and 11th variable, you won't need to expand LRECL to 32767. You can use the default which, I believe, reads from the first 256 characters on the line. Just remove lrecl=32767 and see what difference that makes.
Good luck.
Hi Astounding,
It did not change much in terms of processing time. I still have around an hour (give or take 5 minutes) to bring the finle in.
Thanks for the pointers,
The last attempt I would suggest is a combination of what I originally suggested plus data_null_'s enhancement. Given that you are using pipe-delimited data, the program would be:
data fields2;
infile 'O:\RISK.CSV' firstobs=2;
length record_nb $10 ficclbv8_score $ 5;
input @;
record_nb = scan(_infile_, 5, '|', 'MQ');
ficclbv8_score = scan(_infile_, 11, '|', 'MQ');
run;
With fingers suitably crossed ...
Hi Astounding,
Processing time is still slightly higher than the original solution.
Thank you for your suggestion.
Here's an untested approach that may be faster. Only testing will tell for sure.
data want;
infile csvfile lrecl=5000;
input @;
charvar = scan(_infile_, 97, ',', 'M');
numvar = input( scan(_infile_, 1104, ',' 'M'), 8.);
run;
The example assumes a few things. CHARVAR is the 97th variable reading from left to right, and NUMVAR is the 1104th variable. Also, ALL commas get treated as delimiters. If you have any other commas in the raw data (such as within text strings, for example), this approach won't work.
Good luck.
I don't know quite what you mean about "write" 100% of the data. If you use something like:
data want (keep= var1 var2);
infile ...
input ...
run;
then only the two variables you want will be written to the dataset as read. Yes the temporary buffer as each line is read contains the variables up to the second one you want (the input statement shouldn't include any variables past the ones you want to reduce the size of the input buffer), but that is ONE record in memory at a time not written to the disk.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.