Hi guys!
I've tried to update and insert new data to castable. But My code takes a lot of time to complete. Do you guys know how to use upsert by any chance? Or how to optimise my code?
Thanks a lot!!
My code right now:
Data public.A;
Input ID Name$ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;
Data public.B;
Input ID Name$ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
run;
/* select all rows from public.a and flag rows that exist in both public.a, public.b */
proc fedsql sessref=MySession;
create table public.C as
select x.*,case when y.ID is not null then 1 else 0 end as flag
from public.a x
left join public.b y
on x.ID = y.ID;
quit;
/* delete rows that exist in both public.a, public.b from public.c*/
proc cas;
table.deleteRows
table={caslib='public',
name='c',
where="flag = 1"};
quit;
/* drop column flag*/
proc cas;
table.alterTable /
name = 'c', caslib ='public',
drop={'flag'};
quit;
/* appending new data to the base */
data public.c (append=yes);
set public.b;
run;
public.a
public.b
public.c
I don't see update in your logic, just insert of new records...?
Not a CAS specific advice: try to reduce number of steps.
Instead of a left join an flag creation, try a where clause with sub-query.
where c.id not in(select distinct id from public.b)
And then and insert/append.
Append + de-duplicate??
APPEND :
DE-DUPLICATE :
You can use the native deduplication.deduplicate CAS action to remove duplicate rows.
Koen
Oh I see your point. But when I tried it, look like fedsql doesn't support in/any/all subquery.
proc fedsql sessref=MySession;
create table public.C as
select *
from public.a x
where x.id not in(select distinct id from public.b);
quit;
It thew an error
ERROR: Unsupported operation in FedSQL query: IN/ANY/ALL subquery.
Indeed,
FedSQL for CAS does not support use of non-correlated subqueries with the IN, ANY, and ALL predicates. https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casfedsql/p1ucns3lrhzhamn1k6fw6vhzhzam.htm.
You can try to do an alternate query.
Koen
Appreciate that you're sharing code with sample data that we can run to replicate what you describe.
I believe in your case we need to better understand what you've got in your real environment.
1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)
2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?
3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).
4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?
5. Which Viya version are you using?
1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)
- Nope, Actually it's used for store merged data from table A(old data) and table B (new data).
2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?
- I do have two source table. In my real code, I have table A and table D. Then I copy data from table D to Table B to rename columns.
3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).
- In CAS. table A has 45 million row, table B has 1000-10000 rows.
4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?
- In CAS. Actually it's table A.
5. Which Viya version are you using?
- 2022.09 and 2022.1
@Mayt Given your answers and that you're creating table C couldn't you just use a data merge step?
%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
%do;
cas mySess terminate;
%end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;
data casuser.A;
input ID Name $ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;
data casuser.B;
input ID Name $ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
data casuser.c;
merge casuser.a casuser.b;
by id;
run;
proc print data=casuser.c;
where id in (1,2,5);
run;
/*cas &sessref terminate;*/
With all tables in CAS the data step will execute in CAS
08 data casuser.c; 109 merge casuser.a casuser.b; 110 by id; 111 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: There were 5 observations read from the table A in caslib CASUSER(<user>). NOTE: There were 4 observations read from the table B in caslib CASUSER(<user>). NOTE: The table c in caslib CASUSER(<user>) has 7 observations and 3 variables.
Given the volume of your table A:
data casuser.a;
merge casuser.a casuser.b;
by id;
run;
N.B: The ID column used for BY processing may-not be defined as VARCHAR(*). Let us know should this be the case.
Due to enormous data, I'm afraid that data step might take longer time than fedsql.
@Mayt wrote:
Due to enormous data, I'm afraid that data step might take longer time than fedsql.
@Mayt What makes you think that? The data step will execute multithreaded within CAS - as a CAS action. The main challenge is that you create a new in-memory CAS table that's then also no more in-synch with the physical backing table.
It's most of the time better to do data prep before loading the data into CAS and only load the final table server side into CAS. To propose any such approach we would need to understand more about your scenario, where the data comes from, how often you have to update it in CAS and what it gets used for.
The "problem" here is that CAS is mainly built for analytics, not for DB operations.
Which we can see in the syntax limitations of PROC FEDSQL, data step and PROC CASL.
3.) So B would have 45'' records after the operation?
Is this repeated, like daily?
What is your pain point, from a user and/or maintenance perspective?
Again (partly) untested, but if you have avialable RAM, you could possible overcome limitations of UPDATE, recreate the table using a FEDSQL join query
proc fedsql sessref=MySession;
create table public.C as
select
coalesce(x.id,y.id) as id,
coalesce(x.name, y.name) as name,
coalesce(x.height, y.height) as height
from public.a x
full join public.b y
on x.ID = y.ID;
drop table public.b;
quit;
/* renaming back to 'B' */
proc casutil...
altertable...
where the data comes from?
- new data comes from API.
It's like new data comes from API --> update to old data --> feed to the model and save data(from 2) as physical file to sas.
how often you have to update it in CAS?
- every 20 minutes
what it gets used for?
- use as input to train model.
What is your pain point, from a user and/or maintenance perspective?
- due to the circle need to repeat every 20 minutes, sometimes the data can't be updated completely before feed to model. So I need to wait next round to complete it. But I want to finish it within each round. That's why I think if there's a way to update data faster, that might be great.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.