Re: Performance, joining two tables 14m rows

PhilipH · Posted 12-12-2016 02:35 AM

@Patrick

yes, indeed I forgot the CREATE TABLE. Thanks for that.

I have used Hash Tables and DI Studio as well to create the two big tables
in the first place (those two big tables are made of 30 join) BUT looking
through a "small" subset of those two tables I can not tell developers to create
HASH objects in between just to do the join. It is just bad usability and
they wont accept it.

I would use alias but this is just a tiny test of what is about to hit our SAS server
when 20+ developers start working with those two big tables.

@rogerjdeangelis. Thank you very much for your example!
Great work!
I can imagine it is trouble coming
up with 110m rows of example data. That is why this remains
a theoretical discusison about concepts and server settings.

14m is the data for 2015. That is a total of 14m rows that are then counted (distinct) and
are aggregated to maybe 400 rows - cant tell for sure because the result has not yet
been produced by myself 🙂

Of course. I actually only need 11 variables.

year, office, country, office_name,car,car_name,residue,residuename,
count(distinct STAMM.sta_id), count (distinct erg_id), count(distinct cartypeid)

one STA_ID might have 1 to n ERG_IDs and one STA_ID might have 1 to n CARTYPE_IDs

The names of offices and cars and residues is hard to predict. pick anything.
there are no exact duplicated ERG_ID is a GUID and I do a count distinct on sta_id and cartypeID.

Patrick · Posted 12-12-2016 08:44 AM

@PhilipH

I obviously can't know what you're dealing with. 20+ developers sounds like a very big SAS project; or are this 20 power users consuming some sort of analytical base table(s) for modelling and reporting purposes?

I'd also ask the question why you have to develop with full data volumes. This increases development time often substantially. You certainly want then full volumes as early in the test cycle as possible right after unit testing (ST); or at least "representative" volumes in order to detect performance issues as early as possible.

If your table is that central, important and huge then I'd have a performance optimized focus which always requires some extra coding - and I'd honestly wouldn't really care too much about some developer's preferences unless this opinion helps to better meet the requirements.

"those two big tables are made of 30 join"

Hopefully not in a single big SQL. The SAS SQL compiler can get "confused" in such situations and performance can become really bad. You normally don't want to join more than 3 to 4 tables at once.

"It is just bad usability and they wont accept it."

I disagree with you and can only hope that at least some of your developers are on senior SAS coding level and they are not all database developers which got some SAS upskilling; else you're up for a lot of trouble.

BTW: Should your usage of the big table mainly be in yearly slices then consider to store the data in separate yearly SAS files. You can always create a view over these files for access to the whole data at once. As already proposed using SPDE could also help; especially if the main way of accessing the data will be via SQL code.

Ksharp · Posted 12-10-2016 05:48 AM

Make indexes for those table might give you faster.
And add options before SQL might give you faster too.

options bufno=100 bufsize=128k cpucount=10;

rogerjdeangelis · Posted 12-12-2016 09:08 AM

As a side note and this may not hep.

Count distinct is much faster and can be both muti-threaded and multitasked , if you use a group by.

Even if you do not want the distincts grouped. You can drop the grouping on a second count distinct.

You can even use a group by with a numeric primary key using the mod function.

This often reduces the skewness ( each task or thread processes the same number of records).

This is how Oracle and Terdata process big data, with PERL running SAS batch you should be able to assign a core to each

task. I use an inexpensive Dell T7400 8 core workstation(circa 2008) with 64gb abd dual raid 0 arrays too outperform most

servers on small datasets ie (small data occurs when no single permanent or temporary table exceeds 1TB).

SAS worstation(SAS calls it PC-SAS) allows up to 16 cores and 32gb and includes SPDE (data partitioning and parallel processing of mutiple indexes)

My definition of Big Data is when any single table exceeds 1TB, Big Computation is when more than 16 cores are needed.

WPS, the SAS clone, does not restrict workstation cores or memory?

Servers for petabytes and large simulations.

rogerjdeangelis · Posted 12-12-2016 09:24 AM

Disregard the the comment about the second count distinct. My fingers were ahead of my brain. The grouping variable only works if the grouping variable is somehow part of the variable you want a count distinct of.

ie 'NY-20906-PLAN-163'
'VT-05301-PLAN-188'

In this case you can count distinct states
and then sum

This