Hello, I'm designing a Data Mart solution on SAS. I have a couple of design question. The main fact table is a list of tickets. By business request each update to one of this ticket overwrites the last one instead of adding a row. This is the first problem, I'm afraid that an update of such a large table is more expensive than simply adding a row (we're talking of 100M distinct tickets).
Another problem is that one field of this table consist of a text description of the ticket status. The variable lenght is defined at 2000 and multiplied for 100M rows I'm worry it will generate a table so big it'll be difficult to handle.
How can I manage the text field? Do I keep it in the main fact table or do I treat it like a separated dimension (even if it has the same cardinality as the fact_table?)
I'm thinking that many of the analytical jobs that use this table downstream don't need the text field so keeping it in a different table could improve join elaboration time but I also understand it's bad design.
What's your advice? Thanks
> This is the first problem, I'm afraid that an update of such a large table is more expensive that simply adding a row (we're talking of 100M distinct tickets).
Yes it is more expensive. It looks like you have no choice though. Even if you append you need to disable the existing entry. How many updates a day on this table?
I am unsure I understand the rest of the text. Paragraphs help legibility too.
Thanks, I rewrote my question, I hope it's more readable now.
I have about 90k tickets coming from operational systems daily, about 10% of which are updates, but it varies a lot (100k new ticket the other day and only 4k yesterday)
Many of the procedures will ignore variables not specifically listed. Exceptions are a few like Proc Print or Proc Freq that process every variable unless restricted with either data set options or list of variables explicitly to use such on a Var statement in Proc Print or Tables statement in Proc Freq.
One of the nice things about SAS are the data set options that let you filter variables or records for any other purpose. So you can ignore undesired fields by Dropping them when not wanted.
Proc anyproc data=mybigdataset (drop=longtextvariable);
is pseudo code for removing the variable if it might be a problem.
How I understand things even if dropping the variables you will still have to read the data initially (if a SAS table). It's just not adding the variables to the PDV.
Will this be a SAS or a database table? If database - which one?
Are the text descriptions "free form" or is there a limited number of possible descriptions which you could store in a reference table?
If you have 10% updates: Is that eventually only driven by a few columns that are often changing? If so then you could consider to store the data in two tables with one table having the frequently changing columns. If stored as SAS tables then both tables could have an identical sort order so they are easily combined (could be a data step view).
As a first step I'll load the updated data from Oracle in a staging area then I'll update the tickets in the main table based on new data so the answer is SAS dataset. The description is free form I'm afraid.
So have you tried to use SPDE? What difference did it make?
I see you chose the SPDE reply as the solution. Would you care to share your findings?
Yes we're still in design phase, some new requirements are rolling in so It'll be a while before implementation, but I'll definitely try this approach.
It's even possible that the main stakeholder changed his mind about wanting to keep the history of past ticket status. In this eventuality SPDE is no longer a viable option? (Plus we'd have to deal with possibly a 300M rows fact table assuming about 3 updates for ticket)
How often do you get such updates/inserts? Is that some overnight process or are you considering intraday processing. If it's the later then you might also face some table locking challenges.
Definitely overnight process. The process that is up currently (developed by another company) runs for half a day, that's why we are starting from scratch, revisiting requisites and try to optimize the solution.
We'll try both engine options and see what works best then. Thanks again
Then giving SPDE a shot like @ChrisNZ proposes is definitely worth it.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.