For each answer I get more questions, like what does the model to with the data it gets fed by every 20 minutes?
What kind of API calls are there? Sounds more like a stream, but those typically doesn't have updates.
Updates typically belongs to a relational database platform.
So, it sounds like a complex use case, and I think you need someone at your site to guide you to a functioning design for this.
@Mayt In below sample code table A is your big table with 45 Million rows and table B is your transaction table with 1T to 10T rows used to upsert table A.
There is unfortunately not table action merge.
The available table actions append, update and deleteRows have a parameter whereTable that allows to base processing on values of another table.
But even with using the whereTable parameter I couldn't find a way to define the append only for IDs in table B that don't already exist in table A and I couldn't think of a performant approach to work out the updates and inserts upfront.
For this reason the approach I've chosen where I believe it should reasonably perform given the volumes of table B:
1. Delete all rows in table A with an id that also exists in table B
2. Append (insert) all rows from table B to table A
With databases a delete operation is much slower than an update or an insert. Not sure how that is with CAS - but then table B is not that big. I'm really curious how below code performs.
If you're going to use this code with your real data then please let us know on how many nodes (threads) it executes and how long it takes.
%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
  %do;
    cas mySess terminate;
  %end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;
data casuser.A;
  input ID Name $ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;
data casuser.B (duplicate=yes);
  input ID Name $ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
proc cas;
  /* delete all rows in table A with an ID that also exists in table B */
  table.deleteRows /
    table={
      caslib='casuser'
      name='A',
      whereTable={
        caslib='casuser',
        name='B',
        vars={{name="ID"}}
        }
      };
  /* append (insert) all rows from table B to table A */
  table.append /
    source={caslib='casuser', name='B'},
    target={caslib='casuser', name='A'}
    ;
  /* drop table B */
  table.dropTable /
    caslib='casuser'
    ,name='B'
    ;
quit;
proc print data=casuser.A;
run;
/*cas &sessref terminate;*/
I'm creating table B with all rows on all nodes to avoid the need for data movement via cas_disk_cash during the delete and append operations. This is for performance reasons and given the volume of table B shouldn't impact memory consumption to much.
data casuser.B (duplicate=yes);
Not yet covered but something you need to address
I don't save the cas in-memory table back to the physical table in the backing store and though this table gets out of sync.
If you "lose" your CAS in-memory table for example as part of a CAS restart then you need somewhere the physical data for it. If this data doesn't already exist then you need also a process to upsert your physical table in the backing store.
Addendum
Like @LinusH I was asking myself why you need to re-train your model every 20 minutes against all these 45M rows. But then that's not my home turf and when Googling I found that there appear to be cases where that's the right thing to do. ...but it's certainly worth for you to consider if there could be other ways to keep your model sufficiently up-to-date without having to re-process all the data every 20 minutes.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
