Desktop productivity for business analysts and programmers

Working on updated data only.

Reply
Frequent Contributor
Posts: 77

Working on updated data only.

Hi everyone,

Actually, I am trying to solve one problem,
I have one dataset which will update daily (automatically).. (Dataset "B").
and my ETL works on that dataset. Now my new requirement is to work on only delta record.


..
requirements:

1) In that dataset one column(i.e. DATEofComplaint) will updated (on which I want to work) and I want only those observations which were updated till yesterday from lasttime ETL Runs.
even if ETL was not run for few days .

2) even if ETL runs two or more times in a day, then also data will get until yesterday ..
and then

3) I am appending output data to permanant Dataset (i.e. "F") so in that duplicate observations are not allowed and I can't use NODUP or NODUPKEY for that permanent dataset because dataset is very large.


.....................................................................................................................

 


I am exporting system date to know on which Day ETL runs last time.

(Suppose this ETL last time RUNS on 28AUG2017 then "lastETLDate" will 27AUG2017 )


STEP 1:


Data sasdata.SystemDate;
format LastETLDate date9.;
LastETLDate=today()-1;
run;

 

..............
Step 2:
then importing that date
(27AUG2017)
..............
Step 3:
Suppose today's date is 01SEP2017

data kk;
A=today()-1;
run;

A=31AUG2017
..............
Step 4:

data QueryFoRB;
set sasdata.B;
where DateofComplaint between LastETLDate and A;
run;

..

Now, what should I do to overcome the problem of appending duplicate records when ETL runs two or more time in day....
because in last step of ETL I made code that will append data to a permanant dataset which comes from Step 4.


for reference, I have attached a rough diagram of SAS EG.

 

program or any suggestion will be appreciated.

Thank you very much.

 

 

 

 

Super Contributor
Posts: 271

Re: Working on updated data only.

Posted in reply to india2016

You need to store date and time in lastETLDate column.  Each would update run lastETLDatetime and  lastETLDatetime should be same for all new records.

 

Next time whenever you are want to run the ETL then find maximum lastETLDatetime.

 

Records updated after maximum lastETLDatetime should be considered.

 

 

Frequent Contributor
Posts: 77

Re: Working on updated data only.

It means that I should create one column in final output with name LastETLRuns. and put a condition on that? which will give me only updated observations.

 

Am I right?

 

 

Super Contributor
Posts: 271

Re: Working on updated data only.

Posted in reply to india2016

Yes.

 

If you intend to run only once in a day then you can live with date value and if you intend to run more than one time in a day then you should go for datetime value to differentiate exections carried out in one single day. 

Super User
Posts: 5,827

Re: Working on updated data only.

Posted in reply to india2016
It's considered best practice to have a timestamp column in each table reflecting the last update operation. Not only when you get a requirement like this.
Data never sleeps
Super User
Posts: 5,827

Re: Working on updated data only.

Posted in reply to india2016
If you only have daily inserts, consider to have a daily pile table separate from the total table. Accessing this would be considerably cheaper than querying the whole history.
Data never sleeps
Ask a Question
Discussion stats
  • 5 replies
  • 228 views
  • 2 likes
  • 3 in conversation