Help using Base SAS procedures

reading data from huge flatfile

Reply
New Contributor MVK
New Contributor
Posts: 2

reading data from huge flatfile

My question is I have a flatfile size 247 gb wich In turn I will be creating 8 datasets out of it and it would take around 23 hours to complete this job .. as this file is located in  unix sever I could use split comman to break the file again the question with place cant have so much space occupied for the spllited file .. I would really appreciate of any thoughts on  this I am thinking of solution like parallelism where 8 datasets will processed at the same time.

Thanks

Mvk

Super User
Super User
Posts: 6,499

Re: reading data from huge flatfile

Remove one of the duplicates of this question.

Super User
Posts: 6,936

Re: reading data from huge flatfile

How about getting the number of lines first with wc -l (or simply divide the file size by record size if you have fixed length records), then split that into 8 parts and do 8 data steps in succession where you supply firstobs= and obs= on the infile statement?

I would not recommend running 8 concurrent jobs against the same file and physical devices, it might cause delays.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 3,887

Re: reading data from huge flatfile

A 297GB file. That's just crazy! I don't have the final answer but just some observations/tests and some thoughts.

I believe likely bottlenecks here are the network (eventually) and writing to disk. It's of course true that reading from a text file will be sequential and single threaded so having 8 files and 8 processes would speed up things. But then splitting your file into 8 slices also requires write operations so may-be you wouldn't gain that much overall.

I've done some testing in regards of writing the files on my laptop with SAS 9.4 under Win7, 16GB memory and a secondary internal e-sata drive (in my slot for the dvd drive).

What I found is (in my environment);

- compression made a huge difference (so: writing IS the bottleneck for me)

- using the SPD Engine made a significant difference (as then writing happened multithreaded to both of my disks in parallel)

- there was a weird threshold of around 2GB and when writing more data then real time dropped significantly. The strange thing was that this also happened when I split up the files written either by using the SPD Engine or by writing some macro code which actually created a new SAS table per GB. I have absolutely no explanation for this. I would have expected a linear increase in real time. I hope someone can shed some light on this.

I've also played around with "bufsize" and "bufno" but for my sample data didn't really manage to get much better performance (but managed to get much worse performance).

Some modern storage systems have hardware compression which is much better than what "compress=yes" does. So in such an environment turning on compression would only affect network traffic.

You will certainly need to "play around" with coding options in your specific environment because what will work best depends very much on the underlying hardware and the network.

If you split up your data into 8 SAS data sets these files will still be around 38GB (uncompressed). I would assume that using the SPD Engine using multiple disks would also be beneficial in regards of querying the data later on.

Some of my test results:

for 1 GB, base engine and compress=no

      real time           1.98 seconds

      user cpu time       0.04 seconds

      system cpu time     1.45 seconds

      memory              617.12k

      OS Memory           15444.00k

      Timestamp           13/03/2014 08:23:57 PM

      Step Count                        21  Switch Count  59

  

for 10 GB, base engine and compress=no

      real time           2:33.14

      user cpu time       2.87 seconds

      system cpu time     8.01 seconds

      memory              617.71k

      OS Memory           16468.00k

      Timestamp           13/03/2014 10:28:15 PM

      Step Count                        134  Switch Count  59 

  

for 10 GB, base engine and compress=yes

      real time           9.38 seconds

      user cpu time       8.20 seconds

      system cpu time     1.12 seconds

      memory              605.00k

      OS Memory           16468.00k

      Timestamp           13/03/2014 10:31:56 PM

      Step Count                        136  Switch Count  63

  

  

for 10 GB with SPDE, 2 disks and compress=no

NOTE: DATA statement used (Total process time):

      real time           1:21.82

      user cpu time       4.13 seconds

      system cpu time     7.33 seconds

      memory              1589.87k

      OS Memory           18264.00k

      Timestamp           13/03/2014 09:04:19 PM

      Step Count                        65  Switch Count  18251

  

for 20 GB with SPDE, 2 disks and compress=no

   NOTE: DATA statement used (Total process time):

      real time           3:37.94

      user cpu time       7.56 seconds

      system cpu time     15.55 seconds

      memory              1644.53k

      OS Memory           18520.00k

      Timestamp           13/03/2014 09:12:49 PM

      Step Count                        67  Switch Count  38084

for 30 GB with SPDE, 2 disks and compress=no

NOTE: DATA statement used (Total process time):

      real time           7:44.74

      user cpu time       12.18 seconds

      system cpu time     25.84 seconds

      memory              1715.46k

      OS Memory           18520.00k

      Timestamp           13/03/2014 09:26:33 PM

      Step Count                        69  Switch Count  55626

  

for 20 GB with SPDE, 2 disks and compress=yes

      real time           18.35 seconds

      user cpu time       17.23 seconds

      system cpu time     4.07 seconds

      memory              1709.59k

      OS Memory           18520.00k

      Timestamp           13/03/2014 09:32:23 PM

      Step Count                        71  Switch Count  40590  

Here the code I've used:

options fullstimer compress=yes;

data sample;
  format a $1024.;
  a=repeat('a',500);
run;

/*libname out spde 'c:\temp\tests'*/
/* DATAPATH=('c:\temp\tests' 'd:\temp\tests') PARTSIZE=1G ;*/

libname out 'c:\temp\tests';

data out.crazy(drop=_: );
  set sample;

  do _gig=1 to 10;
    do _i=1 to 1048576;
      output;
    end;
  end;
run;

And here the one where I wrote a separate table per GB of data:

options fullstimer compress=no mprint;

data sample;
  format a $1024.;
  a=repeat('a',500);
run;

libname out 'd:\temp\tests';

%macro test(gig);
  %do i=1 %to &gig;
    data out.crazy_&i. (drop=_: );
      set sample;
      do _i=1 to 1048576;
        output;
      end;
    run;
  %end;
%mend;

%test(10);

Ask a Question
Discussion stats
  • 3 replies
  • 385 views
  • 6 likes
  • 4 in conversation