Hello,
We have a table that's updated daily. It's updated by means of SCD2 without 'close out records'. This means that when a row is deleted from the source table, it remains as-is in the target table.
Now we'd like to add a new column named DELETED_IN_SOURCE_FLG. It should be set to '1' if the row no longer exists in the source table.
Example: A source table named SRC_EMAIL, and a target table named EDS_EMAIL, with the primary key EMAIL_ID.
What would be the most efficient way to say "if an EMAIL_ID exists in EDS_EMAIL, but no longer exists in SRC_EMAIL, then set DELETED_IN_SOURCE_FLG = 1"?
The only way I can think of at the moment sounds a bit cumbersome. Select all distinct EMAIL_ID from the source table, and from the target table, then join them and flag the ones that don't have a match, and write that to the target table. Is there a better way?
This should identify the primary_keys involved if I understand.
proc sql; select email_id from EDS_email except select email_id from SRC_email ; run;
I almost never write SAS base, but here's what I've got so far. That gives me the EPOST_ID which exists in the target but no longer in the source. Is there a way to optimize it or improve the standard of the code?
/* ESTABLISH DATASETS*/ data SRC; set &_input; run; data EDS; set EDS.EMAIL; run; /*SELECT DISTINCT VALUES*/ proc sort data = SRC nodupkey; by EMAIL_ID; run; proc sort data = EDS nodupkey; by EMAIL_ID; run; /*DETERMINE EPOST_ID TO BE DELETED*/ proc sql; create table &_output as select EDS.EMAIL_ID from EDS left join SRC on EDS.EMAIL_ID = SRC.EMAIL_ID where SRC.EMAIL_ID is missing; quit;
You don't need to pre-sort data when using SQL.
Below how you would do this normally when using SQL (not only in SAS)
data source;
set sashelp.class;
if _n_=2 or _n_=5 then delete;
run;
data target;
set sashelp.class;
run;
proc sql;
create table deleted as
select *
from target t
where not exists
(select * from source s where s.name=t.name)
;
quit;
You shouldn't need a deleted flag column in a proper SCD2 table. You simply expire the record.
From what you write (the 'close out records' bit) I assume that you're using DI Studio. So here IF you're doing a full load then just start using the 'close out records', if you're doing a delta load then you also must somewhere have/derive the information which columns got deleted since the last SCD2 load. If so then run a post process which uses a SQL update and expire all records which got deleted in source.
Not using a deleted flag column but just expiring records then also allows to have the SCD2 loader do all the work when it comes to re-opening records (=records which had been expired but which now re-appear in the source table)
Thanks for the feedback @Patrick. You're correct that I'm using DI Studio. However we don't want to use 'close out rows', because that would set another flag (CURRENT_FLG) to '0' for those rows. For reporting purposes, those rows are still considered current. One of the analysts wants us to mark the rows which no longer exists in the source table, but it shouldn't interfere with our current reporting of all rows that have CURRENT_FLG = 1.
Are you sure? I was in the believe the current_ind column (or however you name it) is always set to 1 for the most current record whether expired or not.
Hmmm... when I'm choosing "close out records not in source table" then DIS generates the following code bit:
/* Close out any records that are in the xref table but not in the source. */
else if inXref and not inSort then
do;
ETLS_CLSDATE = DATETIME();
output work.etls_close;
end;
There is nothing in it that changes the change current indicator so it should remain 'Y' for expired records. Are you really sure that your DIS job behaves differently? If so then please share the generated code.
The purpose of using "close out records" is to close the row though, i.e. to set it as no longer being current.
Our table which uses "close out records" doesn't contain any closed rows with CURRENT_IND=1.
The DI Studio help file confirms the behaviour:
You're right. I've just tested it. The change current ind gets always set to 0 or N if a record gets expired. That's disappointing because that's this way simply redundant information.
So what you could do is implement a post process which updates the change current indicator to 1 / Y for the latest change end date per business key.
This should identify the primary_keys involved if I understand.
proc sql; select email_id from EDS_email except select email_id from SRC_email ; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.