BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Crysis85
Obsidian | Level 7

Hello, I'm designing a Data Mart solution on SAS. I have a couple of design question. The main fact table is a list of tickets. By business request each update to one of this ticket overwrites the last one instead of adding a row. This is the first problem, I'm afraid that an update of such a large table is more expensive than simply adding a row (we're talking of 100M distinct tickets).

Another problem is that one field of this table consist of a text description of the ticket status. The variable lenght is defined at 2000 and multiplied for 100M rows I'm worry it will generate a table so big it'll be difficult to handle.

How can I manage the text field? Do I keep it in the main fact table or do I treat it like a separated dimension (even if it has the same cardinality as the fact_table?)

I'm thinking that many of the analytical jobs that use this table downstream don't need the text field so keeping it in a different table could improve join elaboration time but I also understand it's bad design.

What's your advice? Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20
Thanks for clarifying.
This looks like a good match for SPDE's strengths.
- The indexing performance is much better. Perfect for these daily updates of less than 1% of the table
- The compression is much better. Perfect for that long field. I typically use compress=binary but try what's best for your data. And use partsize=500g if the table is stored wholly in one location. These options go in the libname statement.
- Whether you need one or two tables depends on the usage as you have pointed out. One is simpler to manage, and avoids a massive index-based join when you need the text. Test how much a single table slows down the updates and the common analytics jobs, and how slow the join is when the text is needed, and decide. Figure-based decisions are best. 🙂

View solution in original post

14 REPLIES 14
ChrisNZ
Tourmaline | Level 20

> This is the first problem, I'm afraid that an update of such a large table is more expensive that simply adding a row (we're talking of 100M distinct tickets).

Yes it is more expensive. It looks like you have no choice though. Even if you append you need to disable the existing entry. How many updates a day on this table?

 

I am unsure I understand the rest of the text. Paragraphs help legibility too.

Crysis85
Obsidian | Level 7

Thanks, I rewrote my question, I hope it's more readable now.

I have about 90k tickets coming from operational systems daily, about 10% of which are updates, but it varies a lot (100k new ticket the other day and only 4k yesterday)

ballardw
Super User

Many of the procedures will ignore variables not specifically listed. Exceptions are a few like Proc Print or Proc Freq that process every variable unless restricted with either data set options or list of variables explicitly to use such on a Var statement in Proc Print or Tables statement in Proc Freq.

 

 

One of the nice things about SAS are the data set options that let you filter variables or records for any other purpose. So you can ignore undesired fields by Dropping them when not wanted.

 

Proc anyproc   data=mybigdataset (drop=longtextvariable); 

 

is pseudo  code for removing the variable if it might be a problem.

Patrick
Opal | Level 21

How I understand things even if dropping the variables you will still have to read the data initially (if a SAS table). It's just not adding the variables to the PDV.

ChrisNZ
Tourmaline | Level 20
Thanks for clarifying.
This looks like a good match for SPDE's strengths.
- The indexing performance is much better. Perfect for these daily updates of less than 1% of the table
- The compression is much better. Perfect for that long field. I typically use compress=binary but try what's best for your data. And use partsize=500g if the table is stored wholly in one location. These options go in the libname statement.
- Whether you need one or two tables depends on the usage as you have pointed out. One is simpler to manage, and avoids a massive index-based join when you need the text. Test how much a single table slows down the updates and the common analytics jobs, and how slow the join is when the text is needed, and decide. Figure-based decisions are best. 🙂
Patrick
Opal | Level 21

Will this be a SAS or a database table? If database - which one?

Are the text descriptions "free form" or is there a limited number of possible descriptions which you could store in a reference table?

 

If you have 10% updates: Is that eventually only driven by a few columns that are often changing? If so then you could consider to store the data in two tables with one table having the frequently changing columns. If stored as SAS tables then both tables could have an identical sort order so they are easily combined (could be a data step view).

Crysis85
Obsidian | Level 7

As a first step I'll  load the updated data from Oracle in a staging area then I'll update the tickets in the main table based on new data so the answer is SAS dataset. The description is free form I'm afraid.

ChrisNZ
Tourmaline | Level 20

So have you tried to use SPDE? What difference did it make?

ChrisNZ
Tourmaline | Level 20

I see you chose the SPDE reply as the solution. Would you care to share your findings?

Crysis85
Obsidian | Level 7

Yes we're still in design phase, some new requirements are rolling in so It'll be a while before implementation, but I'll definitely try this approach.

It's even possible that the main stakeholder changed his mind about wanting to keep the history of past ticket status. In this eventuality SPDE is no longer a viable option? (Plus we'd have to deal with possibly a 300M rows fact table assuming about 3 updates for ticket)

ChrisNZ
Tourmaline | Level 20
SPDE is often a better option.You should benchmark and pick what works best.
Patrick
Opal | Level 21

How often do you get such updates/inserts? Is that some overnight process or are you considering intraday processing. If it's the later then you might also face some table locking challenges.

Crysis85
Obsidian | Level 7

Definitely overnight process. The process that is up currently (developed by another company) runs for half a day, that's why we are starting from scratch, revisiting requisites and try to optimize the solution.

 

We'll try both engine options and see what works best then. Thanks again

Patrick
Opal | Level 21

Then giving SPDE a shot like @ChrisNZ proposes is definitely worth it.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1695 views
  • 0 likes
  • 4 in conversation