Re: reading data from huge flatfile

MVK · Posted 03-12-2014 02:21 PM

My question is I have a flatfile size 247 gb wich In turn I will be creating 8 datasets out of it and it would take around 23 hours to complete this job .. as this file is located in unix sever I could use split comman to break the file again the question with place cant have so much space occupied for the spllited file .. I would really appreciate of any thoughts on this I am thinking of solution like parallelism where 8 datasets will processed at the same time.

Thanks

Mvk

Tom · Posted 03-12-2014 02:31 PM

Remove one of the duplicates of this question.

Kurt_Bremser · Posted 03-13-2014 02:17 AM

How about getting the number of lines first with wc -l (or simply divide the file size by record size if you have fixed length records), then split that into 8 parts and do 8 data steps in succession where you supply firstobs= and obs= on the infile statement?

I would not recommend running 8 concurrent jobs against the same file and physical devices, it might cause delays.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 03-13-2014 08:00 AM

A 297GB file. That's just crazy! I don't have the final answer but just some observations/tests and some thoughts.

I believe likely bottlenecks here are the network (eventually) and writing to disk. It's of course true that reading from a text file will be sequential and single threaded so having 8 files and 8 processes would speed up things. But then splitting your file into 8 slices also requires write operations so may-be you wouldn't gain that much overall.

I've done some testing in regards of writing the files on my laptop with SAS 9.4 under Win7, 16GB memory and a secondary internal e-sata drive (in my slot for the dvd drive).

What I found is (in my environment);

- compression made a huge difference (so: writing IS the bottleneck for me)

- using the SPD Engine made a significant difference (as then writing happened multithreaded to both of my disks in parallel)

- there was a weird threshold of around 2GB and when writing more data then real time dropped significantly. The strange thing was that this also happened when I split up the files written either by using the SPD Engine or by writing some macro code which actually created a new SAS table per GB. I have absolutely no explanation for this. I would have expected a linear increase in real time. I hope someone can shed some light on this.

I've also played around with "bufsize" and "bufno" but for my sample data didn't really manage to get much better performance (but managed to get much worse performance).

Some modern storage systems have hardware compression which is much better than what "compress=yes" does. So in such an environment turning on compression would only affect network traffic.

You will certainly need to "play around" with coding options in your specific environment because what will work best depends very much on the underlying hardware and the network.

If you split up your data into 8 SAS data sets these files will still be around 38GB (uncompressed). I would assume that using the SPD Engine using multiple disks would also be beneficial in regards of querying the data later on.

Some of my test results:

for 1 GB, base engine and compress=no

real time 1.98 seconds

user cpu time 0.04 seconds

system cpu time 1.45 seconds

memory 617.12k

OS Memory 15444.00k

Timestamp 13/03/2014 08:23:57 PM

Step Count 21 Switch Count 59

for 10 GB, base engine and compress=no

real time 2:33.14

user cpu time 2.87 seconds

system cpu time 8.01 seconds

memory 617.71k

OS Memory 16468.00k

Timestamp 13/03/2014 10:28:15 PM

Step Count 134 Switch Count 59

for 10 GB, base engine and compress=yes

real time 9.38 seconds

user cpu time 8.20 seconds

system cpu time 1.12 seconds

memory 605.00k

OS Memory 16468.00k

Timestamp 13/03/2014 10:31:56 PM

Step Count 136 Switch Count 63

for 10 GB with SPDE, 2 disks and compress=no

NOTE: DATA statement used (Total process time):

real time 1:21.82

user cpu time 4.13 seconds

system cpu time 7.33 seconds

memory 1589.87k

OS Memory 18264.00k

Timestamp 13/03/2014 09:04:19 PM

Step Count 65 Switch Count 18251

for 20 GB with SPDE, 2 disks and compress=no

NOTE: DATA statement used (Total process time):

real time 3:37.94

user cpu time 7.56 seconds

system cpu time 15.55 seconds

memory 1644.53k

OS Memory 18520.00k

Timestamp 13/03/2014 09:12:49 PM

Step Count 67 Switch Count 38084

for 30 GB with SPDE, 2 disks and compress=no

NOTE: DATA statement used (Total process time):

real time 7:44.74

user cpu time 12.18 seconds

system cpu time 25.84 seconds

memory 1715.46k

OS Memory 18520.00k

Timestamp 13/03/2014 09:26:33 PM

Step Count 69 Switch Count 55626

for 20 GB with SPDE, 2 disks and compress=yes

real time 18.35 seconds

user cpu time 17.23 seconds

system cpu time 4.07 seconds

memory 1709.59k

OS Memory 18520.00k

Timestamp 13/03/2014 09:32:23 PM

Step Count 71 Switch Count 40590

Here the code I've used:

options fullstimer compress=yes;

data sample;
format a $1024.;
a=repeat('a',500);
run;

/*libname out spde 'c:\temp\tests'*/
/* DATAPATH=('c:\temp\tests' 'd:\temp\tests') PARTSIZE=1G ;*/

libname out 'c:\temp\tests';

data out.crazy(drop=_: );
set sample;

do _gig=1 to 10;
    do _i=1 to 1048576;
      output;
    end;
end;
run;

And here the one where I wrote a separate table per GB of data:

options fullstimer compress=no mprint;

data sample;
format a $1024.;
a=repeat('a',500);
run;

libname out 'd:\temp\tests';

%macro test(gig);
%do i=1 %to &gig;
    data out.crazy_&i. (drop=_: );
      set sample;
      do _i=1 to 1048576;
        output;
      end;
    run;
%end;
%mend;

%test(10);

reading data from huge flatfile