Is there a way to upsert to castable?

Mayt · Posted 02-16-2024 03:49 AM

Hi guys!

I've tried to update and insert new data to castable. But My code takes a lot of time to complete. Do you guys know how to use upsert by any chance? Or how to optimise my code?

Thanks a lot!!

My code right now:

Data public.A;
Input ID Name$ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;

Data public.B;
Input ID Name$ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
run;
/* select all rows from public.a and flag rows that exist in both public.a, public.b */
proc fedsql sessref=MySession;
create table public.C as 
select x.*,case when y.ID is not null then 1 else 0 end as flag
from public.a x 
left join public.b y
on x.ID = y.ID;
quit;
/* delete rows that exist in both public.a, public.b from public.c*/
proc cas;
table.deleteRows                    
table={caslib='public',
name='c',
where="flag = 1"};
quit;
/* drop column flag*/
proc cas;
table.alterTable / 
	name = 'c', caslib ='public', 
	drop={'flag'};
quit;

/* appending new data to the base */
data public.c (append=yes);
set public.b;
run;

public.a

public.b

public.c

LinusH · Posted 02-16-2024 07:25 AM

I don't see update in your logic, just insert of new records...?
Not a CAS specific advice: try to reduce number of steps.
Instead of a left join an flag creation, try a where clause with sub-query.

where c.id not in(select distinct id from public.b)

And then and insert/append.

Data never sleeps

sbxkoenk · Posted 02-16-2024 08:01 AM

Append + de-duplicate??

APPEND :

The SAS® Viya® platform provides multiple methods to append to CAS tables
By Kevin Russell on SAS Users November 6, 2023
https://blogs.sas.com/content/sgf/2023/11/06/sas-viya-append-cas-tables/
How to Emulate PROC APPEND in CAS?
By Steven Sober on SAS Users November 20, 2017
https://blogs.sas.com/content/sgf/2017/11/20/how-to-emulate-proc-append-in-cas/
Concurrent data append and update to a global CAS table
By Uttam Kumar on SAS Users November 15, 2017
https://blogs.sas.com/content/sgf/2017/11/15/concurrent-data-append-and-update-to-a-global-cas-table...

DE-DUPLICATE :
You can use the native deduplication.deduplicate CAS action to remove duplicate rows.

Koen

Mayt · Posted 02-16-2024 09:21 AM

But some rows are not duplicate. Like ID 5,7 in public.a and public.b. Its height column has different values. I want ID 5,7 in public.c have the value from public.b.

Mayt · Posted 02-16-2024 08:47 AM

I think I need to update. You see ID 5,7 the height value of public.a(old
data) and public.b(new data) are different. I want to store the value from
public.b

Mayt · Posted 02-16-2024 08:57 AM

Oh I see your point. But when I tried it, look like fedsql doesn't support in/any/all subquery.

proc fedsql sessref=MySession;
create table public.C as 
select *
from public.a x 
where x.id not in(select distinct id from public.b);
quit;

It thew an error

ERROR: Unsupported operation in FedSQL query: IN/ANY/ALL subquery.

sbxkoenk · Posted 02-16-2024 09:11 AM

Indeed,

FedSQL for CAS does not support use of non-correlated subqueries with the IN, ANY, and ALL predicates. https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casfedsql/p1ucns3lrhzhamn1k6fw6vhzhzam.htm.

You can try to do an alternate query.

Koen

Mayt · Posted 02-16-2024 09:05 AM

More information : I code in SAS studio of SAS Viya

Patrick · Posted 02-16-2024 11:53 AM

Appreciate that you're sharing code with sample data that we can run to replicate what you describe.

I believe in your case we need to better understand what you've got in your real environment.

1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)

2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?

3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).

4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?

5. Which Viya version are you using?

Mayt · Posted 02-18-2024 10:54 PM

1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)
- Nope, Actually it's used for store merged data from table A(old data) and table B (new data).

2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?
- I do have two source table. In my real code, I have table A and table D. Then I copy data from table D to Table B to rename columns.

3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).
- In CAS. table A has 45 million row, table B has 1000-10000 rows.
4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?
- In CAS. Actually it's table A.
5. Which Viya version are you using?
- 2022.09 and 2022.1

Patrick · Posted 02-19-2024 12:25 AM

@Mayt Given your answers and that you're creating table C couldn't you just use a data merge step?

%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
  %do;
    cas mySess terminate;
  %end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;

data casuser.A;
  input ID Name $ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;

data casuser.B;
  input ID Name $ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
data casuser.c;
  merge casuser.a casuser.b;
  by id;
run;

proc print data=casuser.c;
  where id in (1,2,5);
run;

/*cas &sessref terminate;*/

With all tables in CAS the data step will execute in CAS

08  data casuser.c;
109    merge casuser.a casuser.b;
110    by id;
111  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 5 observations read from the table A in caslib CASUSER(<user>).
NOTE: There were 4 observations read from the table B in caslib CASUSER(<user>).
NOTE: The table c in caslib CASUSER(<user>) has 7 observations and 3 variables.

Given the volume of your table A:

Not sure how this code will perform. Let us know.
Either drop table A after this process from memory or also define it as output of the merge step so the process replaces table A (to save some memory).
```
data casuser.a;
  merge casuser.a casuser.b;
  by id;
run;
```

N.B: The ID column used for BY processing may-not be defined as VARCHAR(*). Let us know should this be the case.

Mayt · Posted 02-19-2024 02:34 AM

Due to enormous data, I'm afraid that data step might take longer time than fedsql.

Patrick · Posted 02-19-2024 08:19 AM

@Mayt wrote:

Due to enormous data, I'm afraid that data step might take longer time than fedsql.

@Mayt What makes you think that? The data step will execute multithreaded within CAS - as a CAS action. The main challenge is that you create a new in-memory CAS table that's then also no more in-synch with the physical backing table.

It's most of the time better to do data prep before loading the data into CAS and only load the final table server side into CAS. To propose any such approach we would need to understand more about your scenario, where the data comes from, how often you have to update it in CAS and what it gets used for.

LinusH · Posted 02-19-2024 07:49 AM

The "problem" here is that CAS is mainly built for analytics, not for DB operations.

Which we can see in the syntax limitations of PROC FEDSQL, data step and PROC CASL.

3.) So B would have 45'' records after the operation?

Is this repeated, like daily?

What is your pain point, from a user and/or maintenance perspective?

Again (partly) untested, but if you have avialable RAM, you could possible overcome limitations of UPDATE, recreate the table using a FEDSQL join query

proc fedsql sessref=MySession;
create table public.C as 
	select 
		coalesce(x.id,y.id) as id, 
		coalesce(x.name, y.name) as name, 
		coalesce(x.height, y.height) as height
	from public.a x 
	full join public.b y
	on x.ID = y.ID;

drop table public.b;
quit;
/* renaming back to 'B' */
proc casutil...
altertable...

Data never sleeps

Mayt · Posted 02-20-2024 03:39 AM

@Patrick @LinusH

where the data comes from?
- new data comes from API.
It's like new data comes from API --> update to old data --> feed to the model and save data(from 2) as physical file to sas.
how often you have to update it in CAS?
- every 20 minutes
what it gets used for?
- use as input to train model.
What is your pain point, from a user and/or maintenance perspective?
- due to the circle need to repeat every 20 minutes, sometimes the data can't be updated completely before feed to model. So I need to wait next round to complete it. But I want to finish it within each round. That's why I think if there's a way to update data faster, that might be great.

Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Re: Is there a way to upsert to castable?

Registration is open

Registration is open

SAS Training: Just a Click Away