We've a process where clients will update the file (say sales details) when there is a need and they intimate us about the change. So our job is to load those updated records to SQL DB table by maintaing the history.
For the first time while loading the data to SQL we just did a 'proc append' to SQL table after reading the file via data step infile Statement. However this method will not work to track the history in SQL when there is a update in the file.
I'm not Aware of SCDs as well. We are not using any Primary/foreign key stuff in SQL.I would like to understand how I can tackle the situation to update the history. I tried with 'Proc SQL update' and 'If first. and last.' but it didn't helped either. Appericiate your help here.
Assume I've a file like this.
Id | Name | Sales | Transaction | Example Values | Act_Ind | Valid_From | Valid_To |
1 | Unit | True | 6 | 58001 | Y | 19000101T000000 | 99991231T235959 |
2 | Key | True | 6 | 121216 | Y | 19000101T000000 | 99991231T235959 |
3 | Value | True | 18 | 820595,2 | Y | 19000101T000000 | 99991231T235959 |
Assume I've received an update to the file like this. If you notice Transaction value has changed for Key from 6 to 8.
Id | Name | Sales | Transaction | Example Values | Act_Ind | Valid_From | Valid_To |
2 | Key | True | 8 | 121216 | Y | 20200404T000000 | 99991231T235959 |
So now I want data like this SQL table.
Id | Name | Sales | Transaction | Example Values | Act_Ind | Valid_From | Valid_To |
1 | Unit | True | 6 | 58001 | Y | 19000101T000000 | 99991231T235959 |
2 | Key | True | 6 | 121216 | N | 19000101T000000 | 20200403T000000 |
2 | Key | True | 8 | 121216 | Y | 20200404T000000 | 99991231T235959 |
3 | Value | True | 18 | 820595,2 | Y | 19000101T000000 | 99991231T235959 |
When column attributes that affect storage (length and/or type) change, you have to rewrite the whole table. In SAS and in a DBMS. And a column will have the same length throughout the whole table, period.
I suggest you set up a change management process where changes in structure need to be approved by organization and planning, the department boss, and the IT boss. Just to make people think twice (or at least once, in many cases) before causing such a change.
What is the question?
Are you asking how to use the new record to calculate from the existing record:
2 Key char 6 121216 Y 19000101T000000 99991231T235959
to the new records:
2 Key char 6 121216 N 19000101T000000 20200403T000000 2 Key char 8 121216 Y 20200404T000000 99991231T235959
? PS Why doesn't the end timestamp get set to 20200403T235959 ?
Or how to update the table once you know what changes to make?
Looks to me like a simple way would be one of :
- delete the existing record and insert the two new records.
- change the status and end timestamp for the one existing record and insert one new record.
@David_Billa - This is a little off-topic but how are you dealing with a numeric column that could have a maximum of 18 digits when SAS numeric columns can only hold around 15 digits accurately?
Your use case example doesn't make sense to me. It is impossible to change the attribute of a column - its length- for a single row. As @Kurt_Bremser has already stated, it has to change for all rows. So I see no need to create 2 rows for column 2 when no values in the row change at all between the 2. It would be different if the value changed from 121216 to say 12121689, but if that were to happen then you would have to modify the table schema and reload all of the data to accommodate the wider column.
OK, I seem to have misinterpreted your dataset structure.
Please have a look at this:
data have;
infile datalines dlm='09'x dsd truncover;
input ID $ Name $ Sales $ Transaction :$2. Example_Values :$20. Act_Ind $ Valid_From :b8601dt16. Valid_To :b8601dt16.;
format Valid_From Valid_To b8601dt16.;
datalines;
1 Unit True 6 58001 Y 19000101T000000 99991231T235959
2 Key True 6 121216 Y 19000101T000000 99991231T235959
3 Value True 18 820595,2 Y 19000101T000000 99991231T235959
;
data upd;
infile datalines dlm='09'x dsd truncover;
input ID $ Name $ Sales $ Transaction :$2. Example_Values :$20. Act_Ind $ Valid_From :b8601dt16. Valid_To :b8601dt16.;
format Valid_From Valid_To b8601dt16.;
datalines;
2 Key True 8 121216 Y 20200404T000000 99991231T235959
;
proc compare
base=have
compare=upd
out=comp
outdif
outnoequal
noprint
;
by id;
run;
data realupd;
merge
upd (in=a)
comp (in=b keep=id)
;
by id;
if b;
run;
data new;
set
have
realupd (in=n)
;
by id;
if n
then new = 1;
else new = 0;
run;
data want;
merge
new
new (firstobs=2 keep=id new valid_from rename=(id=_id valid_from=_vfrom new=_new))
;
if id = _id and _new
then valid_to = intnx('dtday',_vfrom,-1);
drop new _:;
run;
@Kurt_Bremser Can you provide some comments to explain what those steps are doing? Can you explain why it works?
Good idea 😉
First, I run a proc compare to identify observations that are really new. Then I use the resulting dataset to extract those really new observations from the update dataset.
Then I do an interleave (a set of multiple datasets with a by), and keep a new variable that tells me if I have a new observation added.
In the final step, I do a look-ahead for the valid_from column and the new column. If I detect that the next observation is a new one for the current id, I overwrite the valid_to timestamp with a value 1 day before the valid_from of the added observation.
This step also makes sure that a new id is added without further action.
@Kurt_Bremser Thanks for your code and it is working if the existing data is updated, but it seem to be not working when the new records are inserted. Also I don't want to restrict the 'by' clause only by id and therefore I tried with _all_ as shown below but I'm getting no records in the output after executing proc compare.
I've a feel there might be slight changes are required in the code to handle all situations (new records inserted,any records deleted or existing records updated) which I'm unable to figure out.
data have;
set lib.sql_tbl;
run;
proc sort data=have;
by _all_;
run;
data upd;
set have1;
run;
proc sort data=upd;
by _all_;
run;
proc compare
base=have
compare=upd
out=comp
outdif
outnoequal
noprint
;
by _all_;
run;
data realupd;
merge
upd (in=a)
comp (in=b drop=_type_ _obs_)
;
by _all_;
if b;
run;
Log of proc compare:
26 proc compare 27 base=have 28 compare=upd 29 out=comp 30 outdif 31 outnoequal 32 noprint 33 ; 34 by _all_; 35 run; NOTE: Except for the 15 BY variables, the data sets WORK.HAVE and WORK.UPD have no variables in common. There are no matching variables to compare. Comparisons of data values not performed. NOTE: There were 99 observations read from the data set WORK.HAVE. NOTE: There were 99 observations read from the data set WORK.UPD. NOTE: The data set WORK.COMP has 0 observations and 17 variables.
This is going nowhere.
Please supply example data in data steps with datalines (THIS IS AN ABSOLUTE REQUIREMENT!), and what you want to get out of it.
Right now I have no idea any longer what your data looks like.
Hint: using _all_ in a by statement makes no sense in most cases.
This is exactly what the UPDATE statement is for.
This works:
data MASTER;
input (Id Name Sales Transaction Example Act_Ind Valid_From Valid_To) (:$20.);
cards;
1 Unit True 6 58001 Y 19900101T000000 99991231T235959
2 Key True 6 121216 Y 199100101T000000 99991231T235959
3 Value True 18 820595,2 Y 19920101T000000 99991231T235959
run;
data TRANS;
input (Id Name Sales Transaction Example Act_Ind Valid_From Valid_To) (:$20.);
cards;
2 Key True 8 121216 Y 20200404T000000 99991231T235959
run;
data MASTER;
modify MASTER
TRANS(rename=(
NAME=N SALES=S TRANSACTION=T EXAMPLE=E ACT_IND=A VALID_FROM=VF VALID_TO=VT
));
by ID;
select (_IORC_);
when (%SYSRC(_SOK)) do; %* Transaction matches master;
Valid_To=VF;Act_Ind='N';
replace;
NAME=N; SALES=S; TRANSACTION=T; EXAMPLE=E; ACT_IND=A; VALID_FROM=VF; VALID_TO=VT;
output;
end;
when (%SYSRC(_DSENMR)) do; %* New Transaction ;
output;
_error_=0;
end;
when (%SYSRC(_DSEMTR)) do; %* New Transaction, not first key ;
put 'ERR' 'OR: Duplicate Values on transaction dataset';
stop;
end;
otherwise do;
put 'ERR' 'OR: Unknown IO ';
stop;
end;
end;
run;
Id | Name | Sales | Transaction | Example | Act_Ind | Valid_From | Valid_To |
---|---|---|---|---|---|---|---|
1 | Unit | True | 6 | 58001 | Y | 19900101T000000 | 99991231T235959 |
2 | Key | True | 6 | 121216 | N | 199100101T000000 | 20200404T000000 |
3 | Value | True | 18 | 820595,2 | Y | 19920101T000000 | 99991231T235959 |
2 | Key | True | 8 | 121216 | Y | 20200404T000000 | 99991231T235959 |
Note that you did not provide metadata in the form of a data step for the existing tables, so I just used what was easier.
If the table is so large that sorting it is not an option, a different method that uses an index can be coded to access observations directly.
A good resource: https://www.lexjansen.com/pnwsug/2006/PN01CurtisMackModify.pdf
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.