BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Crysis85
Obsidian | Level 7

Hello, I'm designing a Data Mart solution on SAS. I have a couple of design question. The main fact table is a list of tickets. By business request each update to one of this ticket overwrites the last one instead of adding a row. This is the first problem, I'm afraid that an update of such a large table is more expensive than simply adding a row (we're talking of 100M distinct tickets).

Another problem is that one field of this table consist of a text description of the ticket status. The variable lenght is defined at 2000 and multiplied for 100M rows I'm worry it will generate a table so big it'll be difficult to handle.

How can I manage the text field? Do I keep it in the main fact table or do I treat it like a separated dimension (even if it has the same cardinality as the fact_table?)

I'm thinking that many of the analytical jobs that use this table downstream don't need the text field so keeping it in a different table could improve join elaboration time but I also understand it's bad design.

What's your advice? Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20
Thanks for clarifying.
This looks like a good match for SPDE's strengths.
- The indexing performance is much better. Perfect for these daily updates of less than 1% of the table
- The compression is much better. Perfect for that long field. I typically use compress=binary but try what's best for your data. And use partsize=500g if the table is stored wholly in one location. These options go in the libname statement.
- Whether you need one or two tables depends on the usage as you have pointed out. One is simpler to manage, and avoids a massive index-based join when you need the text. Test how much a single table slows down the updates and the common analytics jobs, and how slow the join is when the text is needed, and decide. Figure-based decisions are best. 🙂

View solution in original post

14 REPLIES 14
ChrisNZ
Tourmaline | Level 20

> This is the first problem, I'm afraid that an update of such a large table is more expensive that simply adding a row (we're talking of 100M distinct tickets).

Yes it is more expensive. It looks like you have no choice though. Even if you append you need to disable the existing entry. How many updates a day on this table?

 

I am unsure I understand the rest of the text. Paragraphs help legibility too.

Crysis85
Obsidian | Level 7

Thanks, I rewrote my question, I hope it's more readable now.

I have about 90k tickets coming from operational systems daily, about 10% of which are updates, but it varies a lot (100k new ticket the other day and only 4k yesterday)

ballardw
Super User

Many of the procedures will ignore variables not specifically listed. Exceptions are a few like Proc Print or Proc Freq that process every variable unless restricted with either data set options or list of variables explicitly to use such on a Var statement in Proc Print or Tables statement in Proc Freq.

 

 

One of the nice things about SAS are the data set options that let you filter variables or records for any other purpose. So you can ignore undesired fields by Dropping them when not wanted.

 

Proc anyproc   data=mybigdataset (drop=longtextvariable); 

 

is pseudo  code for removing the variable if it might be a problem.

Patrick
Opal | Level 21

How I understand things even if dropping the variables you will still have to read the data initially (if a SAS table). It's just not adding the variables to the PDV.

ChrisNZ
Tourmaline | Level 20
Thanks for clarifying.
This looks like a good match for SPDE's strengths.
- The indexing performance is much better. Perfect for these daily updates of less than 1% of the table
- The compression is much better. Perfect for that long field. I typically use compress=binary but try what's best for your data. And use partsize=500g if the table is stored wholly in one location. These options go in the libname statement.
- Whether you need one or two tables depends on the usage as you have pointed out. One is simpler to manage, and avoids a massive index-based join when you need the text. Test how much a single table slows down the updates and the common analytics jobs, and how slow the join is when the text is needed, and decide. Figure-based decisions are best. 🙂
Patrick
Opal | Level 21

Will this be a SAS or a database table? If database - which one?

Are the text descriptions "free form" or is there a limited number of possible descriptions which you could store in a reference table?

 

If you have 10% updates: Is that eventually only driven by a few columns that are often changing? If so then you could consider to store the data in two tables with one table having the frequently changing columns. If stored as SAS tables then both tables could have an identical sort order so they are easily combined (could be a data step view).

Crysis85
Obsidian | Level 7

As a first step I'll  load the updated data from Oracle in a staging area then I'll update the tickets in the main table based on new data so the answer is SAS dataset. The description is free form I'm afraid.

ChrisNZ
Tourmaline | Level 20

So have you tried to use SPDE? What difference did it make?

ChrisNZ
Tourmaline | Level 20

I see you chose the SPDE reply as the solution. Would you care to share your findings?

Crysis85
Obsidian | Level 7

Yes we're still in design phase, some new requirements are rolling in so It'll be a while before implementation, but I'll definitely try this approach.

It's even possible that the main stakeholder changed his mind about wanting to keep the history of past ticket status. In this eventuality SPDE is no longer a viable option? (Plus we'd have to deal with possibly a 300M rows fact table assuming about 3 updates for ticket)

ChrisNZ
Tourmaline | Level 20
SPDE is often a better option.You should benchmark and pick what works best.
Patrick
Opal | Level 21

How often do you get such updates/inserts? Is that some overnight process or are you considering intraday processing. If it's the later then you might also face some table locking challenges.

Crysis85
Obsidian | Level 7

Definitely overnight process. The process that is up currently (developed by another company) runs for half a day, that's why we are starting from scratch, revisiting requisites and try to optimize the solution.

 

We'll try both engine options and see what works best then. Thanks again

Patrick
Opal | Level 21

Then giving SPDE a shot like @ChrisNZ proposes is definitely worth it.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1039 views
  • 0 likes
  • 4 in conversation