BookmarkSubscribeRSS Feed
Mayt
Obsidian | Level 7

Hi guys!

I've tried to update and insert new data to castable. But My code takes a lot of time to complete. Do you guys know how to use upsert by any chance? Or how to optimise my code? 

Thanks a lot!!

My code right now:

Data public.A;
Input ID Name$ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;

Data public.B;
Input ID Name$ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
run;
/* select all rows from public.a and flag rows that exist in both public.a, public.b */
proc fedsql sessref=MySession;
create table public.C as 
select x.*,case when y.ID is not null then 1 else 0 end as flag
from public.a x 
left join public.b y
on x.ID = y.ID;
quit;
/* delete rows that exist in both public.a, public.b from public.c*/
proc cas;
table.deleteRows                    
table={caslib='public',
name='c',
where="flag = 1"};
quit;
/* drop column flag*/
proc cas;
table.alterTable / 
	name = 'c', caslib ='public', 
	drop={'flag'};
quit;

/* appending new data to the base */
data public.c (append=yes);
set public.b;
run;

public.a

Mayt_0-1708072715547.png

public.b 

Mayt_1-1708072768474.png

 

public.c

Mayt_2-1708072970952.png

 

 

 

16 REPLIES 16
LinusH
Tourmaline | Level 20

I don't see update in your logic, just insert of new records...?
Not a CAS specific advice: try to reduce number of steps.
Instead of a left join an flag creation, try a where clause with sub-query.

where c.id not in(select distinct id from public.b)

And then and insert/append.

Data never sleeps
sbxkoenk
SAS Super FREQ

Append + de-duplicate??

 

APPEND :

DE-DUPLICATE :
You can use the native deduplication.deduplicate CAS action to remove duplicate rows. 

 

Koen

Mayt
Obsidian | Level 7
But some rows are not duplicate. Like ID 5,7 in public.a and public.b. Its height column has different values. I want ID 5,7 in public.c have the value from public.b.
Mayt
Obsidian | Level 7
I think I need to update. You see ID 5,7 the height value of public.a(old
data) and public.b(new data) are different. I want to store the value from
public.b
Mayt
Obsidian | Level 7

Oh I see your point. But when I tried it, look like fedsql doesn't support in/any/all subquery.

proc fedsql sessref=MySession;
create table public.C as 
select *
from public.a x 
where x.id not in(select distinct id from public.b);
quit;

It thew an error

ERROR: Unsupported operation in FedSQL query: IN/ANY/ALL subquery.

sbxkoenk
SAS Super FREQ

Indeed,

FedSQL for CAS does not support use of non-correlated subqueries with the IN, ANY, and ALL predicates. https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casfedsql/p1ucns3lrhzhamn1k6fw6vhzhzam.htm.  

 

You can try to do an alternate query.

Koen

Mayt
Obsidian | Level 7
More information : I code in SAS studio of SAS Viya
Patrick
Opal | Level 21

Appreciate that you're sharing code with sample data that we can run to replicate what you describe. 

I believe in your case we need to better understand what you've got in your real environment. 

1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)

2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?

3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).

4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?

5. Which Viya version are you using?

Mayt
Obsidian | Level 7

1. Does the target CAS table public.C preexist in CAS? And what's the volume (number of rows)
- Nope, Actually it's used for store merged data from table A(old data) and table B (new data).

2. Do you really have two source tables public.A and public.B? Or is this "just" about how to update/insert into a CAS target table and though you would in reality only have a single source table?
- I do have two source table. In my real code, I have table A and table D. Then I copy data from table D to Table B to rename columns.

3. Where do your source table(s) reside? In CAS or in Compute? And what are the volumes (number of rows).
- In CAS. table A has 45 million row, table B has 1000-10000 rows.
4. Once you've updated the in-memory CAS table do you also need to save back the changes to the physical table? And if so: Where is this physical table stored?
- In CAS. Actually it's table A.
5. Which Viya version are you using?
- 2022.09 and 2022.1

Patrick
Opal | Level 21

@Mayt Given your answers and that you're creating table C couldn't you just use a data merge step?

%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
  %do;
    cas mySess terminate;
  %end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;

data casuser.A;
  input ID Name $ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;

data casuser.B;
  input ID Name $ Height;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
data casuser.c;
  merge casuser.a casuser.b;
  by id;
run;

proc print data=casuser.c;
  where id in (1,2,5);
run;

/*cas &sessref terminate;*/

With all tables in CAS the data step will execute in CAS

08  data casuser.c;
109    merge casuser.a casuser.b;
110    by id;
111  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 5 observations read from the table A in caslib CASUSER(<user>).
NOTE: There were 4 observations read from the table B in caslib CASUSER(<user>).
NOTE: The table c in caslib CASUSER(<user>) has 7 observations and 3 variables.

Given the volume of your table A:

  • Not sure how this code will perform. Let us know.
  • Either drop table A after this process from memory or also define it as output of the merge step so the process replaces table A (to save some memory).
    data casuser.a;
      merge casuser.a casuser.b;
      by id;
    run;

N.B: The ID column used for BY processing may-not be defined as VARCHAR(*). Let us know should this be the case.

 

 

Mayt
Obsidian | Level 7

Due to enormous data, I'm afraid that data step might take longer time than fedsql.

Patrick
Opal | Level 21

@Mayt wrote:

Due to enormous data, I'm afraid that data step might take longer time than fedsql.


@Mayt What makes you think that? The data step will execute multithreaded within CAS - as a CAS action. The main challenge is that you create a new in-memory CAS table that's then also no more in-synch with the physical backing table. 

 

It's most of the time better to do data prep before loading the data into CAS and only load the final table server side into CAS. To propose any such approach we would need to understand more about your scenario, where the data comes from, how often you have to update it in CAS and what it gets used for.

 

LinusH
Tourmaline | Level 20

The "problem" here is that CAS is mainly built for  analytics, not for DB operations.

Which we can see in the syntax limitations of PROC FEDSQL, data step and PROC CASL.

3.) So B would have 45'' records after the operation?

Is this repeated, like daily?

What is your pain point, from a user and/or maintenance perspective?

Again (partly) untested, but if you have avialable RAM, you could possible overcome limitations of UPDATE, recreate the table using a FEDSQL join query

proc fedsql sessref=MySession;
create table public.C as 
	select 
		coalesce(x.id,y.id) as id, 
		coalesce(x.name, y.name) as name, 
		coalesce(x.height, y.height) as height
	from public.a x 
	full join public.b y
	on x.ID = y.ID;

drop table public.b;
quit;
/* renaming back to 'B' */
proc casutil...
altertable...
Data never sleeps
Mayt
Obsidian | Level 7

@Patrick @LinusH 

where the data comes from?
- new data comes from API.
It's like new data comes from API --> update to old data --> feed to the model and save data(from 2) as physical file to sas.
how often you have to update it in CAS?
- every 20 minutes
what it gets used for?
- use as input to train model.
What is your pain point, from a user and/or maintenance perspective?
- due to the circle need to repeat every 20 minutes, sometimes the data can't be updated completely before feed to model. So I need to wait next round to complete it. But I want to finish it within each round. That's why I think if there's a way to update data faster, that might be great.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1331 views
  • 0 likes
  • 4 in conversation