09-02-2016 03:38 AM
Dear SAS Experts,
I am struggling building a DWH fact table. We cant use our Oracle DB, its too weak to do the joining.
I pulled the whole transactional data about trials and trial results in to the SAS environment. About 36GB worth of data.
I pulled all the about 35 dimensions. each having about 10 000 rows and are just a few MB in size.
The error I get is the following:
ERROR: There is not enough WORK disk space to store the results of an internal sorting phase.
Using Google I figured out to set the SORTSIZE and MEMSIZE to MAX. That did not help at all.
The same error appeared and I lost 450 GB worth of space on my unix machine in 30 minutes.
So that was a no-go either. I have no clue what to do next. In the past I build a few DWH's in Oracle and the
data load I have does not seem a crazy lot to me. We got a brand new server with 4 CPU's and 45 GB of memory but I cant use up
450 GB of some generated data coming from 36 GB of source data. We set the MEMSIZE back to default. Restarted the server and are back to square one.
proc options group=memory; run; proc options option=utilloc; run; proc options option=threads; run; proc options option=workperms; run; proc options option=MEMSIZE;run; proc options option=fullstimer; run; proc sql; CREATE TABLE sasdwhb.fact AS SELECT /*...*/ /*...*/ /*for simplicity I removed the selection of columns*/ /*Joins on dimensions ---------------------trials, about 60m rows---------------------*/ FROM SASDWHB.TRIAL_ALL STA LEFT JOIN SASDWH_M.M... T11 ON STA.MD..._ID = T11.M...._ID LEFT JOIN SASDWH_M.... T22 ON T11.LI...._ID = T22.L...._ID LEFT JOIN SASDWH_M.K001 K122 ON STA.AK = K122.ID LEFT JOIN SASDWH_M.... K1 ON K122.K.... = K1.A.... LEFT JOIN SASDWH_M.K002 K2 ON STA.OM = K2.ID LEFT JOIN SASDWH_M.K003 K31 ON STA.MK = K31.ID LEFT JOIN SASDWH_M.K003 K32 ON STA.MK_ = K32.ID LEFT JOIN SASDWH_M.K004 K4 ON STA.PA = K4.ID LEFT JOIN SASDWH_M.K005 K5 ON STA.APK = K5.ID LEFT JOIN SASDWH_M.K006 K6 ON STA.ZA = K6.ID LEFT JOIN SASDWH_M.K007 K71 ON STA.BE1 = K71.ID LEFT JOIN SASDWH_M.K007 K72 ON STA.BE2 = K72.ID LEFT JOIN SASDWH_M.K007 K73 ON STA.BE3 = K73.ID LEFT JOIN SASDWH_M.K008 K8 ON STA.BA = K8.ID LEFT JOIN SASDWH_M.BE.... K881 ON SUBSTR(K8.K....,1,2) = K881.BE.... LEFT JOIN SASDWH_M.K009 K91 ON STA.PG = K91.ID LEFT JOIN SASDWH_M.K009 K92 ON STA.HG = K92.ID LEFT JOIN SASDWH_M.BU... K111 ON SUBSTR(K92.KO....,1,2) = K111.BL LEFT JOIN SASDWH_M.K010 K10 ON STA.HS = K10.ID LEFT JOIN SASDWH_M.STA.. K112 ON SUBSTR(K10.KO....,1,3) = put(K112.ST....,3.) LEFT JOIN SASDWH_M.K011 K11 ON STA.NAH = K11.ID LEFT JOIN SASDWH_M.K012 K12 ON STA.VA = K12.ID LEFT JOIN SASDWH_M.K013 K13 ON STA.VP = K13.ID LEFT JOIN SASDWH_M.K014 K14 ON STA.BT = K14.ID ---------------------trial results, about 90m rows---------------------*/ INNER JOIN SASDWHB.ERG ERG ON ERG.MDSA...._ID = STA.MDS...._ID LEFT JOIN SASDWH_M.K015 K15 ON ERG.PO = K15.ID LEFT JOIN SASDWH_M.K016 K161 ON ERG.PM = K161.ID LEFT JOIN SASDWH_M.parameter K888 ON SUBSTR(K161.KO...,1,2) = K888.PA..._OG LEFT JOIN SASDWH_M.K016 K162 ON ERG.BZP = K162.ID LEFT JOIN SASDWH_M.K017 K171 ON ERG.MEH = K171.ID LEFT JOIN SASDWH_M.K017 K172 ON ERG.MEH_ = K172.ID LEFT JOIN SASDWH_M.K018 K181 ON ERG.BME = K181.ID LEFT JOIN SASDWH_M.K018 K182 ON ERG.BME_ = K182.ID LEFT JOIN SASDWH_M.K019 K191 ON ERG.ME__K = K191.ID LEFT JOIN SASDWH_M.K020 K20 ON ERG.BW = K20.ID LEFT JOIN SASDWH_M.K021 K21 ON ERG.MS = K21.ID LEFT JOIN SASDWH_M.K022 K22 ON ERG.VB = K22.ID ; QUIT;
09-02-2016 04:36 AM
Few ideas to try:
1) Use option compress=YES defined in the configuration file.
You'll pay by CPU but, depending on your data, you may save approximately 30% of disk space.
2) Are you uploading all the tables at same time. Maybe by changing order of ETL (extract-transform-load) running jobs
you will need less work space.
You need to know how many jobs run parallely at maximum. You can eliminate that number.
09-02-2016 05:13 AM
"... building a DWH fact table"
"... about 35 dimensions"
I'm a bit puzzled. The normal process is not to build a fact table with data from dimensions. Dimensional data is joined at query time, ideally with some nice filtering and aggregation.
Are you in fact () building a large denormized analytical base table or similar?
Does the joining just fetch the dimensional keys (can't tell since you left the select clause out)?
If yes, you can look a the way Data Ingration Studio generates code for this type of operations: it uses data step, and loads the dimensional tables (business key-surrogate key pair) into hash tables. The look-up is done in that data step.
09-05-2016 02:54 AM
09-05-2016 04:17 AM
09-05-2016 04:42 AM
09-05-2016 05:22 AM - edited 09-05-2016 06:04 AM
ok, my SAS DI Studio Lookup Job is running. Lets see how it performs.
09-06-2016 05:50 AM
I added a screen print of my DI STudio Job.
The job runs 1 hour and then fails with the error that there is not enough space in SASDWHB.FACT.DATA
and that SASDWHB.FACT.DATA is damaged. There is still over 500 GB free space on the server.
I wonder if some unix file size restrictions are in place.
09-06-2016 07:37 AM
The disk space you see after failing is the net dosk space you have after exit of SAS session;
In order to do the job the SAS needs more space for its work.
Probably there is realy not enough space to do the work and to save results.
Can you assign more temporary disk space for the job ?
09-06-2016 07:36 AM
09-06-2016 07:41 AM
As mutch as I understand SPDE you pay by DISK to save time.
SPDE needs more disk space in order to maintain data and indexes splitted into many sub files.
I have the feeling that SPDE is not the solution.
09-06-2016 07:40 AM
09-06-2016 08:22 AM - edited 09-06-2016 08:23 AM
@LinusH/Shmuel, you were right, when I looked at the disk space after the job failed it was all available again.
Attached an image from my SAS Web Dashboard showing the disk space. It get's really low
when the job get's really active, near fail.
@LinusH, I just asked the Oracle DB how big the tables are actually on the Oracle side.
Rumours say 200GB so I did NOT imagine SAS would eat away all my 500+ GB!
People say SAS is good with big data sets?...right?
"Denormalizing using that many dimensions should be a real disk eater. There's a reason for star schemas. Is there's actual requirement to have all those dimensions available in a single table...? "
It is indeed a requirement from your scientists to have a mega denormalized data analytics base table
with no more ID's but all clear text. Thats what I am doing/trying. If I fail our scientists are for
ever doomed to do joins themselves on sub sub-set of that data.
"As much as I understand SPDE you pay by DISK to save time."
@Shmuel Yes, I tried SPDE before.
it did not turn out well. The errors also related to not enough space.
I have been working with decent scaling Oracle DB's for many years and 150m rows with 35 joins
did never seem crazy to me until...now. I wonder what the businesses to that have a billion data rows.