WORK disk space: DWH fact/master table joining 35 tables 150m rows

Reply
Contributor
Posts: 71

WORK disk space: DWH fact/master table joining 35 tables 150m rows


Dear SAS Experts,

I am struggling building a DWH fact table. We cant use our Oracle DB, its too weak to do the joining.
I pulled the whole transactional data about trials and trial results in to the SAS environment. About 36GB worth of data.
I pulled all the about 35 dimensions. each having about 10 000 rows and are just a few MB in size.

The error I get is the following:

ERROR: There is not enough WORK disk space to store the results of an internal sorting phase.

Using Google I figured out to set the SORTSIZE and MEMSIZE to MAX. That did not help at all.
The same error appeared and I lost 450 GB worth of space on my unix machine in 30 minutes.
So that was a no-go either. I have no clue what to do next. In the past I build a few DWH's in Oracle and the
data load I have does not seem a crazy lot to me. We got a brand new server with 4 CPU's and 45 GB of memory but I cant use up
450 GB of some generated data coming from 36 GB of source data. We set the MEMSIZE back to default. Restarted the server and are back to square one.

proc options group=memory; run;
proc options option=utilloc; run;
proc options option=threads; run;
proc options option=workperms; run;  
proc options option=MEMSIZE;run;
proc options option=fullstimer; run;

proc sql;
CREATE TABLE sasdwhb.fact AS
SELECT
      /*...*/
      /*...*/
      /*for simplicity I removed the selection of columns*/
      
      /*Joins on dimensions
      ---------------------trials, about 60m rows---------------------*/
 
      FROM SASDWHB.TRIAL_ALL STA

      LEFT JOIN SASDWH_M.M... T11
      ON STA.MD..._ID = T11.M...._ID


      LEFT JOIN SASDWH_M.... T22
      ON T11.LI...._ID = T22.L...._ID


      LEFT JOIN SASDWH_M.K001 K122
      ON STA.AK = K122.ID  
                               

      LEFT JOIN SASDWH_M.... K1
      ON K122.K.... = K1.A....             

      LEFT JOIN SASDWH_M.K002 K2
      ON STA.OM = K2.ID

      LEFT JOIN SASDWH_M.K003 K31
      ON STA.MK = K31.ID

      LEFT JOIN SASDWH_M.K003 K32
      ON STA.MK_ = K32.ID

      LEFT JOIN SASDWH_M.K004 K4
      ON STA.PA = K4.ID

      LEFT JOIN SASDWH_M.K005 K5
      ON STA.APK = K5.ID
                                       
      LEFT JOIN SASDWH_M.K006 K6
      ON STA.ZA = K6.ID

      LEFT JOIN SASDWH_M.K007 K71
      ON STA.BE1 = K71.ID
                                      
      LEFT JOIN SASDWH_M.K007 K72
      ON STA.BE2 = K72.ID
                                       
      LEFT JOIN SASDWH_M.K007 K73
      ON STA.BE3 = K73.ID

      LEFT JOIN SASDWH_M.K008 K8
      ON STA.BA = K8.ID
                               
      LEFT JOIN SASDWH_M.BE.... K881
      ON SUBSTR(K8.K....,1,2)  = K881.BE....               

      LEFT JOIN SASDWH_M.K009 K91
      ON STA.PG = K91.ID
                                       
      LEFT JOIN SASDWH_M.K009 K92
      ON STA.HG = K92.ID

      LEFT JOIN SASDWH_M.BU... K111
      ON SUBSTR(K92.KO....,1,2) = K111.BL        

      LEFT JOIN SASDWH_M.K010 K10
      ON STA.HS = K10.ID
                               
      LEFT JOIN SASDWH_M.STA.. K112
      ON SUBSTR(K10.KO....,1,3) = put(K112.ST....,3.)                       

      LEFT JOIN SASDWH_M.K011 K11
      ON STA.NAH = K11.ID
                                  
      LEFT JOIN SASDWH_M.K012 K12
      ON STA.VA = K12.ID

      LEFT JOIN SASDWH_M.K013 K13
      ON STA.VP = K13.ID
                                       
      LEFT JOIN SASDWH_M.K014 K14
      ON STA.BT = K14.ID


      ---------------------trial results, about 90m rows---------------------*/
      INNER JOIN SASDWHB.ERG ERG
              ON ERG.MDSA...._ID = STA.MDS...._ID
    
      LEFT JOIN SASDWH_M.K015 K15
             ON ERG.PO = K15.ID
                           
      LEFT JOIN SASDWH_M.K016 K161
             ON ERG.PM = K161.ID
                                      
      LEFT JOIN SASDWH_M.parameter K888
             ON SUBSTR(K161.KO...,1,2) = K888.PA..._OG                    

      LEFT JOIN SASDWH_M.K016 K162
             ON ERG.BZP = K162.ID

      LEFT JOIN SASDWH_M.K017 K171
             ON ERG.MEH = K171.ID

      LEFT JOIN SASDWH_M.K017 K172
             ON ERG.MEH_ = K172.ID

      LEFT JOIN SASDWH_M.K018 K181
             ON ERG.BME = K181.ID

      LEFT JOIN SASDWH_M.K018 K182
             ON ERG.BME_ = K182.ID

      LEFT JOIN SASDWH_M.K019 K191
             ON ERG.ME__K = K191.ID

      LEFT JOIN SASDWH_M.K020 K20
             ON ERG.BW = K20.ID

      LEFT JOIN SASDWH_M.K021 K21
             ON ERG.MS = K21.ID

      LEFT JOIN SASDWH_M.K022 K22
             ON ERG.VB = K22.ID
 
;
QUIT;
Trusted Advisor
Posts: 1,400

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

Few ideas to try:

 

1) Use option compress=YES defined in the configuration file.
    You'll pay by CPU but, depending on your data, you may save approximately 30% of disk space.

 

2) Are you uploading all the tables at same time. Maybe by changing order of ETL (extract-transform-load) running jobs

    you will need less work space.
    You need to know how many jobs run parallely at maximum. You can eliminate that number.

 

 

Trusted Advisor
Posts: 1,400

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

One more point:

 

Try to split the SQL step into few steps running in a que !

Super User
Posts: 5,260

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

"... building a DWH fact table"

"... about 35 dimensions"

I'm a bit puzzled. The normal process is not to build a fact table with data from dimensions. Dimensional data is joined at query time, ideally with some nice filtering and aggregation.

Are you in fact (Man Tongue) building a large denormized analytical base table or similar?

Does the joining just fetch the dimensional keys (can't tell since you left the select clause out)?

If yes, you can look a the way Data Ingration Studio generates code for this type of operations: it uses data step, and loads the dimensional tables (business key-surrogate key pair) into hash tables. The look-up is done in that data step.

 

Data never sleeps
Super User
Posts: 9,687

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

HoHo. You have a big table. It is not good for SQL.
Try Hash Table.

Contributor
Posts: 71

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

Hello,
thanks for all your replies.
@Shmuel, I added a compress and did the inner join first.
E.g. I created a table called fact_temp just joining
TRIAL_ALL and ERG. So about 150m rows.
That worked. BUT when I did the 33 other, small, join
to fact_temp I get the error again that there is not
enough work space. There are 590 GB free space
on the disk :/

@LinusH, yes, you are right. It should say building a data mart Smiley Happy
a large very denormalized analytical base table.
Good idea with DI Studio! I will try that right away.

@Ksharp, thanks for the idea. I have not looked at hash table at all.
Contributor
Posts: 71

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

Will my 14 GB memory be enough for a hash table? Can I make the hash table available in web report studio through an information map?
Super User
Posts: 9,687

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

You don't need hash table for that big table, only need hash table for those 35 small tables .In other words,14G is totally enough .
Contributor
Posts: 71

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

[ Edited ]

ok, my SAS DI Studio Lookup Job is running. Lets see how it performs.

Contributor
Posts: 71

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

Hi,

I added a screen print of my DI STudio Job.

The job runs 1 hour and then fails with the error that there is not enough space in SASDWHB.FACT.DATA

and that SASDWHB.FACT.DATA is damaged. There is still over 500 GB free space on the server.

I wonder if some unix file size restrictions are in place.

Bye

 


lookup.png
Trusted Advisor
Posts: 1,400

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

The disk space you see after failing is the net dosk space you have after exit of SAS session;

In order to do the job the SAS needs more space for its work.

Probably there is realy not enough space to do the work and to save results.

 

Can you assign more temporary disk space for the job ?

Super User
Posts: 5,260

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

With tables that size, you should use SPDE.
Can't tell from the information provided what the actual problem is. Are you sure that the free space is available is calculated when your fact table is not cleaned away? The server can have several file systems. Be sure that you check the right one...
Data never sleeps
Trusted Advisor
Posts: 1,400

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

As mutch as I understand SPDE you pay by DISK to save time.

SPDE needs more disk space in order to maintain data and indexes splitted into many sub files.

 

I have the feeling that SPDE is not the solution.

Super User
Posts: 5,260

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

Have you done any calculation of the size of the resulting table? Denormalizing using that many dimensions should be a real disk eater. There's a reason for star schemas. Is there's actual requirement to have all those dimensions available in a single table...?
Data never sleeps
Contributor
Posts: 71

Re: WORK disk space: DWH fact/master table joining 35 tables 150m rows

[ Edited ]

@LinusH/Shmuel, you were right, when I looked at the disk space after the job failed it was all available again.

Attached an image from my SAS Web Dashboard showing the disk space. It get's really low

when the job get's really active, near fail.

 

@LinusH, I just asked the Oracle DB how big the tables are actually on the Oracle side.

Rumours say 200GB so I did NOT imagine SAS would eat away all my 500+ GB!

People say SAS is good with big data sets?...right?

 

 


LinusH wrote:
"Denormalizing using that many dimensions should be a real disk eater. There's a reason for star schemas. Is there's actual requirement to have all those dimensions available in a single table...? "



@LinusH
It is indeed a requirement from your scientists to have a mega denormalized data analytics base table

with no more ID's but all clear text. Thats what I am doing/trying. If I fail our scientists are for

ever doomed to do joins themselves on sub sub-set of that data.

 

 


Shmuel wrote:
"As much as I understand SPDE you pay by DISK to save time."

 

@Shmuel Yes, I tried SPDE before.

it did not turn out well. The errors also related to not enough space.

 

I have been working with decent scaling Oracle DB's for many years and 150m rows with 35 joins

did never seem crazy to me until...now. I wonder what the businesses to that have a billion data rows.


disk_space.png
Ask a Question
Discussion stats
  • 21 replies
  • 740 views
  • 8 likes
  • 5 in conversation