My question is I have a flatfile size 247 gb wich In turn I will be creating 8 datasets out of it and it would take around 23 hours to complete this job .. as this file is located in unix sever I could use split comman to break the file again the question with place cant have so much space occupied for the spllited file .. I would really appreciate of any thoughts on this I am thinking of solution like parallelism where 8 datasets will processed at the same time.
Thanks
Mvk
Remove one of the duplicates of this question.
How about getting the number of lines first with wc -l (or simply divide the file size by record size if you have fixed length records), then split that into 8 parts and do 8 data steps in succession where you supply firstobs= and obs= on the infile statement?
I would not recommend running 8 concurrent jobs against the same file and physical devices, it might cause delays.
A 297GB file. That's just crazy! I don't have the final answer but just some observations/tests and some thoughts.
I believe likely bottlenecks here are the network (eventually) and writing to disk. It's of course true that reading from a text file will be sequential and single threaded so having 8 files and 8 processes would speed up things. But then splitting your file into 8 slices also requires write operations so may-be you wouldn't gain that much overall.
I've done some testing in regards of writing the files on my laptop with SAS 9.4 under Win7, 16GB memory and a secondary internal e-sata drive (in my slot for the dvd drive).
What I found is (in my environment);
- compression made a huge difference (so: writing IS the bottleneck for me)
- using the SPD Engine made a significant difference (as then writing happened multithreaded to both of my disks in parallel)
- there was a weird threshold of around 2GB and when writing more data then real time dropped significantly. The strange thing was that this also happened when I split up the files written either by using the SPD Engine or by writing some macro code which actually created a new SAS table per GB. I have absolutely no explanation for this. I would have expected a linear increase in real time. I hope someone can shed some light on this.
I've also played around with "bufsize" and "bufno" but for my sample data didn't really manage to get much better performance (but managed to get much worse performance).
Some modern storage systems have hardware compression which is much better than what "compress=yes" does. So in such an environment turning on compression would only affect network traffic.
You will certainly need to "play around" with coding options in your specific environment because what will work best depends very much on the underlying hardware and the network.
If you split up your data into 8 SAS data sets these files will still be around 38GB (uncompressed). I would assume that using the SPD Engine using multiple disks would also be beneficial in regards of querying the data later on.
Some of my test results:
for 1 GB, base engine and compress=no
real time 1.98 seconds
user cpu time 0.04 seconds
system cpu time 1.45 seconds
memory 617.12k
OS Memory 15444.00k
Timestamp 13/03/2014 08:23:57 PM
Step Count 21 Switch Count 59
for 10 GB, base engine and compress=no
real time 2:33.14
user cpu time 2.87 seconds
system cpu time 8.01 seconds
memory 617.71k
OS Memory 16468.00k
Timestamp 13/03/2014 10:28:15 PM
Step Count 134 Switch Count 59
for 10 GB, base engine and compress=yes
real time 9.38 seconds
user cpu time 8.20 seconds
system cpu time 1.12 seconds
memory 605.00k
OS Memory 16468.00k
Timestamp 13/03/2014 10:31:56 PM
Step Count 136 Switch Count 63
for 10 GB with SPDE, 2 disks and compress=no
NOTE: DATA statement used (Total process time):
real time 1:21.82
user cpu time 4.13 seconds
system cpu time 7.33 seconds
memory 1589.87k
OS Memory 18264.00k
Timestamp 13/03/2014 09:04:19 PM
Step Count 65 Switch Count 18251
for 20 GB with SPDE, 2 disks and compress=no
NOTE: DATA statement used (Total process time):
real time 3:37.94
user cpu time 7.56 seconds
system cpu time 15.55 seconds
memory 1644.53k
OS Memory 18520.00k
Timestamp 13/03/2014 09:12:49 PM
Step Count 67 Switch Count 38084
for 30 GB with SPDE, 2 disks and compress=no
NOTE: DATA statement used (Total process time):
real time 7:44.74
user cpu time 12.18 seconds
system cpu time 25.84 seconds
memory 1715.46k
OS Memory 18520.00k
Timestamp 13/03/2014 09:26:33 PM
Step Count 69 Switch Count 55626
for 20 GB with SPDE, 2 disks and compress=yes
real time 18.35 seconds
user cpu time 17.23 seconds
system cpu time 4.07 seconds
memory 1709.59k
OS Memory 18520.00k
Timestamp 13/03/2014 09:32:23 PM
Step Count 71 Switch Count 40590
Here the code I've used:
options fullstimer compress=yes;
data sample;
format a $1024.;
a=repeat('a',500);
run;
/*libname out spde 'c:\temp\tests'*/
/* DATAPATH=('c:\temp\tests' 'd:\temp\tests') PARTSIZE=1G ;*/
libname out 'c:\temp\tests';
data out.crazy(drop=_: );
set sample;
do _gig=1 to 10;
do _i=1 to 1048576;
output;
end;
end;
run;
And here the one where I wrote a separate table per GB of data:
options fullstimer compress=no mprint;
data sample;
format a $1024.;
a=repeat('a',500);
run;
libname out 'd:\temp\tests';
%macro test(gig);
%do i=1 %to &gig;
data out.crazy_&i. (drop=_: );
set sample;
do _i=1 to 1048576;
output;
end;
run;
%end;
%mend;
%test(10);
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.