SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Optimal number of tracked Columns in SCDs

Reply
Learner
Posts: 1

Optimal number of tracked Columns in SCDs

Hi - I currently have a dimension which has ~60,000 rows and 350 variables. The business want to start tracking history and to turn this dimension into a SCD - so I asked them which variables they want to track and they came back with a list of approx 30 variables - my feeling is that this is too many and will have a significant affect on performance but I have nothing to back that up.  Has anyone done any analysis or know if SAS recommend the maximum amount of tracked columns will work efficiently.

I tried to do some searching but have not been able to find anything apart from "The number and length of these columns affects the run-time performance of the job" - so I know it will affect performance but the business want to know what the maximum is so they can work on reducing their list.

 

Any advice would be appreciated.

Thanks

Rachel

Super User
Posts: 5,441

Re: Optimal number of tracked Columns in SCDs

Since the dimension has quite few rows, I don't think that 30 columns is too much, at least from a performance view.
But one can ask why do you have a dimension with 350 cols? What user can grasp this amount of information? For me, it sounds like you have some data modelling tasks comming up.
Data never sleeps
Respected Advisor
Posts: 4,173

Re: Optimal number of tracked Columns in SCDs

It's only 60T rows so I wouldn't be too worried about performance for creating a digest value for thirty type 2 columns.

 

Where I have some concerns: If you're using the OOTB SCD Type 2 loader then the other 59970 columns are type 1 and the code generated will also create a digest value for all these columns. I'm not sure if the SCD Type 2 loader has been implemented to deal nicely with such a vast number of columns. Make sure that you check the log for truncation warnings and the like. Eventually consider to implement this as a custom transformation with your own code.

 

As @LinusH wrote: May be some re-modelling is required.

 

Are your business users aware of the fact that they will have to change all their queries adding a selection on "change_current_ind='Y'" or something similar?

 

Do you know what your business users want this history records for? May-be maintaining a separate history table or a change event table would serve the purpose as well.

Super User
Posts: 5,441

Re: Optimal number of tracked Columns in SCDs

Elaborating a bit more.

Business stakeholders are seldom interested in history records, especially when it comes to labels/names.

Sounds like you don't have a detail data store - which is the natural place to keep a complete history record.

Then you can more easily design and create data marts (dimensions and facts) that business requires at the moment. This will hopefully lead to smaller and more easy to use data patterns (and not so much historization, and fewer columns).

 

Btw, by not track changes as Type 2, it does not necessary mean classifying them as type 1, it just means that they are only updated when any type 2 columns for the row has new value.

Data never sleeps
Ask a Question
Discussion stats
  • 3 replies
  • 257 views
  • 0 likes
  • 3 in conversation