Updating a Dataset Incrementally

Scott_Mitchell · Posted 10-08-2014 08:40 PM

Hi Guys,

First I should preface this post with the fact that I am using 9.1.3, so all the relevant constraints of using such an old version apply.

A little bit of history about this task:

Everyday, I recieve a file containing a snapshot of what the all the accounts look like at the end of the day. Each account can have multiple products which could be considered sub accounts. The extract date in the below example is created by me and is actually just a suffix attached to each of the TXT files that I recieve. There is no date or indicator within the dataset that states that a transaction took place today (so I can't simply subset the data based on date and append to the history file that I am creating).

What I need to do is create a History file containing any movements that took place compared to the most recent in the previous days History file.

I have tried using both the UPDATE statement in a datastep and a Hash Object, but for obvious reasons this doesn't work (the obvious reason being that if the accounts changes from 1000 on Day1 to 2000 on Day2 then back to 1000 on Day3, we end up updating the record from Day1).

Could someone please help me formulate a solution for this issue?

Thank you for your help.

DATA DAY1;
INFILE DATALINES;
INPUT ACCNO BALANCE PRODUCT $ EXTRACTDATE : DATE9.;
FORMAT EXTRACTDATE DATE9.;
DATALINES;
1 1000 P1 01OCT2014
1 1000 P2 01OCT2014
2 2000 P1 01OCT2014
2 1000 P2 01OCT2014
3 9000 P1 01OCT2014
4 7000 P1 01OCT2014
5 4000 P1 01OCT2014
6 5000 P1 01OCT2014
7 10000 P1 01OCT2014
;
RUN;

DATA DAY2;
INFILE DATALINES;
INPUT ACCNO BALANCE PRODUCT $ EXTRACTDATE : DATE9.;
FORMAT EXTRACTDATE DATE9.;
DATALINES;
1 10000 P1 02OCT2014
1 3000 P2 02OCT2014
2 4000 P1 02OCT2014
2 5000 P2 02OCT2014
3 6000 P1 02OCT2014
4 7000 P1 02OCT2014
5 8000 P1 02OCT2014
6 9000 P1 02OCT2014
7 10000 P1 02OCT2014
8 15000 P1 02OCT2014
;
RUN;

DATA DAY3;
INFILE DATALINES;
INPUT ACCNO BALANCE PRODUCT $ EXTRACTDATE : DATE9.;
FORMAT EXTRACTDATE DATE9.;
DATALINES;
1 1000 P1 03OCT2014
1 1000 P2 03OCT2014
2 2000 P1 03OCT2014
2 1000 P2 03OCT2014
3 9000 P1 03OCT2014
4 7000 P1 03OCT2014
5 4000 P1 03OCT2014
6 5000 P1 03OCT2014
7 10000 P1 03OCT2014
;
RUN;

DATA DESIRED;
INFILE DATALINES dlm=",";
INPUT ACCNO BALANCE PRODUCT $ EXTRACTDATE : DATE9.;
FORMAT EXTRACTDATE DATE9.;
DATALINES;
1,1000,P1,01OCT2014
1,1000,P2,01OCT2014
1,3000,P2,02OCT2014
1,10000,P1,02OCT2014
2,1000,P2,01OCT2014
2,2000,P1,01OCT2014
2,4000,P1,02OCT2014
2,5000,P2,02OCT2014
3,6000,P1,02OCT2014
3,9000,P1,01OCT2014
4,7000,P1,01OCT2014
5,4000,P1,01OCT2014
5,8000,P1,02OCT2014
6,5000,P1,01OCT2014
6,9000,P1,02OCT2014
7,10000,P1,01OCT2014
8,15000,P1,02OCT2014
1,1000,P1,03OCT2014
1,1000,P2,03OCT2014
2,2000,P1,03OCT2014
2,1000,P2,03OCT2014
3,9000,P1,03OCT2014
4,7000,P1,03OCT2014
5,4000,P1,03OCT2014
6,5000,P1,03OCT2014
7,10000,P1,03OCT2014
;
RUN;

PROC SORT DATA = DESIRED OUT = DESIRED;
BY ACCNO BALANCE PRODUCT;
RUN;

LinusH · Posted 10-09-2014 02:25 AM

It's in situations like this when you want an etl tool, such as Sas di studio. It has this out of the box.

Data never sleeps

Scott_Mitchell · Posted 10-09-2014 02:33 AM

Hi Linus,

Completely agree, but given that we are still on 9.1.2 I don't see that on the horizon any time soon.

I have now developed a solution for this problem.

Once it is completed and tested I will share my results for completeness.

RW9 · Posted 10-09-2014 05:17 AM

Hi,

Well, I don't have time to write the full thing here, but my idea would be to create 4 tables from the initial data:

ACCOUNTS - ACCNO, EXTRACTDATE

ACCOUNTS_J - ACTION, ACTION_DATE, ACCNO, EXTRACTDATE

PRODUCTS - ACCNO, PRODUCT, BALANCE

PRODUCTS_J - ACTION, ACTION_DATE, ACCNO, PRODUCT, BALANCE

Initially set these up on your first data. Then subsequent data would follow the steps:

Account -> Is it a new one, then

insert "New record" row into ACCOUNTS_J

insert "New record" row into PRODUCTS_J

insert data into ACCOUNTS

insert data into PRODUCTS

Account -> Its not new

insert "Updated extract date" record in ACCOUNTS_J

Products -> Is it new, yes then

insert "New record" row into PRODUCTS_J

insert data into PRODUCTS

update record in ACCOUNTS with new DATE

Products -> Its not new then

insert "Updated balance" row into PRODUCTS_J

insert data into PRODUCTS

update record in ACCOUNTS with new DATE

Something like the above. That way you have two tables with the latest information, and a journal table with the history which you can easily pull out for a log.

jakarman · Posted 10-10-2014 03:42 AM

Scot, You are mentioning detecting all movements in a history file. That triggered Linus as it is the concept of slowly changing dimensions. That is used for Building datamarts/dwh.

Slowly changing dimension - Wikipedia, the free encyclopedia (Kimball Inmon). The DI tool is supporting those concepts . With the datavault from Lindstedt you are assumed to put in the date the records are coming in.Data Vault Modeling - Wikipedia, the free encyclopedia The source from where it came from you are already knowing could be a field.

So far so good.

Datasets coming in building one big SAS dataset containing as the data-vault date/source with all other information should not be a big logical problem

It could even be one datastep reading them all. Unless there are sizing performance issues.

Extending the original dataset could be done in al lot of ways.

- Several datasets/tables for each account when the number of accounts is limited.

A joined view of all of them is an additional easy step.

- using indexes and constraints on a masterdataset that get updated. may be a lot of coding/selections to get the right ones.)

- with limited number of columns to check it could be an approach to add columns for each new date, may be a rolling structure.

Suppose you are only required being very detailed to have 90-days on 5 columns.

I prefer the row approach of the flexibility.

All depends on some more details you know (we at this moment not)

I do not get this one:

"the obvious reason being that if the accounts changes from 1000 on Day1 to 2000 on Day2 then back to 1000 on Day3, we end up updating the record from Day1".

What do you mean with that.

---->-- ja karman --<-----

Scott_Mitchell · Posted 10-13-2014 06:35 PM

Good Afternoon All,

I said I would post my solution for completeness and here it is.

DATA DAY1;
INPUT ACC_NO FIRST_NAME $ LAST_NAME $ DATE1 : DATE9.;
FORMAT DATE1 DATE9.;
DATALINES;
1 A B 01JAN2014
2 A B 01JAN2014
3 A B 01JAN2014
4 A B 01JAN2014
5 A B 01JAN2014
6 A B 01JAN2014
7 A B 01JAN2014
8 A B 01JAN2014
;
RUN;

DATA DAY2;
INPUT ACC_NO FIRST_NAME $ LAST_NAME $ DATE1 : DATE9.;
FORMAT DATE1 DATE9.;
DATALINES;
1 A B 02JAN2014
2 A B 02JAN2014
3 A B 02JAN2014
4 A B 02JAN2014
5 A B 02JAN2014
6 B C 02JAN2014
7 A B 02JAN2014
8 A B 02JAN2014
;
RUN;

DATA _NULL_;
IF 0 THEN SET DAY1;
IF _N_ = 1 THEN DO;
DECLARE HASH H1 (DATASET:"DAY1",ORDERED:"A");
H1.DEFINEKEY("ACC_NO","FIRST_NAME","LAST_NAME");
H1.DEFINEDATA("ACC_NO","FIRST_NAME","LAST_NAME");
H1.DEFINEDONE();

DECLARE HASH H2 (DATASET:"DAY1",ORDERED:"A");
H2.DEFINEKEY("ACC_NO","FIRST_NAME","LAST_NAME","DATE1");
H2.DEFINEDATA("ACC_NO","FIRST_NAME","LAST_NAME","DATE1");
H2.DEFINEDONE();
END;

SET DAY2 END = EOF;

RC = H1.CHECK();

IF RC ~= 0 THEN RC1 = H2.ADD();

IF EOF THEN H2.OUTPUT(DATASET:"HISTORY");
RUN;

Initially I wasn't thinking laterally enough. I wanted to compare each days file against the HISTORY dataset, however after posting I decided it would make more sense in my situation to compare DAY1 file to DAY2 file, then DAY2 file against DAY3. I used a HASH TABLE because my data is reasonably small once I subset it and the update adds only 30 observation maximum from day to day. Of course this could be replaced with an DATASTEP UPDATE.

Thank you all for your help.

Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Re: Updating a Dataset Incrementally

Click image to register for webinar

Classroom Training Available!