BookmarkSubscribeRSS Feed
chris16
Calcite | Level 5

Hi,

 

Im trying to create an incremental load process from a source table which contains over 40m rows of data.

At the moment i am utilizing the created date field  to extract and load the latest rows to my final SAS mart but the problem is its not getting the correct rows in, I'll explain.

 

so the   source table fields (Answers_Data table)

ID   Qst_Id    Yes/No    Ver_no  User_id   Created_dt

1     112          1              1             345       01Jan19:00:00:00

1     113          2              1             345       01Jan19:00:00:00

1     113          1              1             456       01Jan19:00:00:00

1     114          1              1             345       01Jan19:00:00:00

1     115          2              1             456       01Jan19:00:00:00

1     112          1              2             345       02Jan19:00:00:00

1     113          2              2             345       02Jan19:00:00:00

1     113          1              2             678       02Jan19:00:00:00

1     114          1              2             345       02Jan19:00:00:00

1     115          2              2             345       02Jan19:00:00:00

 

My goal is to load the latest answers to the final data mart. by using the ver_no field

 

so i first create a subquery to select the latest version answers then to load the data a data step update transformation, with the by variables selected as ID, Qst_Id, User_id

running this query  the first time loads the first version answers data (i.e. ver_no 1)  and then when i run it the following day to load the latest answers (ie. ver_no 2)  it keeps the old record from ver no 1 (highlighted in bold below)

 

the final table ends up looking like this.

ID   Qst_Id    Yes/No    Ver_no  User_id   Created_dt

1     112          1              2             345       02Jan19:00:00:00

1     113          2              2             345       02Jan19:00:00:00

1     113          1              1             456       01Jan19:00:00:00

1     113          1              2             678       02Jan19:00:00:00

1     114          1              2             345       02Jan19:00:00:00

1     115          2              2             345       02Jan19:00:00:00

 

 

what is happening is that we have two user_id;s answering the same question which is fine,  but in the second version of answers a different user_id (678)  is answering the same question and we want to overwrite the old answer by user_id(456) with the new answer by user_id (678)  so that the final table looks like this

 

ID   Qst_Id    Yes/No    Ver_no  User_id   Created_dt

1     112          1              2             345       02Jan19:00:00:00

1     113          2              2             345       02Jan19:00:00:00

1     113          1              2             678       02Jan19:00:00:00

1     114          1              2             345       02Jan19:00:00:00

1     115          2              2             345       02Jan19:00:00:00

 

 

i understand its because it cant find the matching user_id value when using the data step update with the by variables i selected  but how can i load this so that it overwrites the existing values with the new ones.

At the moment the process is doing a full data load which is time consuming and resource intensive hence why im trying to load it incrementally.

 

Any help would be appreciated.

Thanks

 

 

6 REPLIES 6
novinosrin
Tourmaline | Level 20

Sample

Old is verno=1

increment verno=2

 

if my understanding is correct, it's no biggie

 



data old increment;
input (ID   Qst_Id    Yes_No    Ver_no  User_id ) ($)  Created_dt :datetime20.;
format Created_dt datetime20.;
if Ver_no='1' then output old;else output increment;
cards;
1     112          1              1             345       01Jan19:00:00:00
1     113          2              1             345       01Jan19:00:00:00
1     113          1              1             456       01Jan19:00:00:00
1     114          1              1             345       01Jan19:00:00:00
1     115          2              1             456       01Jan19:00:00:00
1     112          1              2             345       02Jan19:00:00:00
1     113          2              2             345       02Jan19:00:00:00
1     113          1              2             678       02Jan19:00:00:00
1     114          1              2             345       02Jan19:00:00:00
1     115          2              2             345       02Jan19:00:00:00
;

data want;
if _n_=1 then do;
   dcl hash H (dataset:'increment',multidata:'y') ;
   h.definekey  ("User_id",'id','Qst_Id') ;
   h.definedata (all:'y') ;
   h.definedone () ;
   dcl hiter hi('h');
end;
set old end=lr;
rc=h.find();
if rc=0 then do;
output;
rc1=h.remove();
end;
if lr ;
do while(hi.next()=0);
output;
end;
drop rc:;
run;

 And then a sort would get the id's in order

 



proc sort data=want;
by id qst_id;
run;

 

 

chris16
Calcite | Level 5
Hi, Thanks for your solution but how would i approach this with multiple version answers. in the data i have seen answers up to ver_no 5 .
Reeza
Super User

what is happening is that we have two user_id;s answering the same question which is fine,  but in the second version of answers a different user_id (678)  is answering the same question and we want to overwrite the old answer by user_id(456) with the new answer by user_id (678)  so that the final table looks like this

 

How do you know this? What fields tell you this record overwrites the user record? Same date/question only? Location of record?

chris16
Calcite | Level 5
Hi, the customer is only interested in the most recent answers so in this case all the ver_no 2 answers. The user_id should overwrite the value in the user_id field same with the qst_id.
When i load this the second time to update the table with the new answers (ver_no 2) i need to overwrite all ver_no 1 answers but i am using a data step update transform with by variables ID, Qst_Id, User_id therefore its trying to match these. So it uploads the new answers (ver_no2) fine but still keeps one of the old ver no 1 answer since it kind find a matching user_id value in the second load
ID Qst_Id Yes/No Ver_no User_id
1 113 1 1 456

hope this make sense.
thanks
mkeintz
PROC Star

I'm not clear on what you want.  It appears you want, for each QST_ID, all records that fall on the latest date.  Is that correct?  (Or maybe you want, for each QST_ID, all records that have the latest version).

 

And is it also true that your data are sorted by date?

 

And does each date possess at least one instance of each QST_ID?

 

Finally, in what order to you want the resulting data presented?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
chris16
Calcite | Level 5
Hi , in a nut shell i just want to load the most recent answers in my final table but rather than doing a full load of everything i want it done incrementally. im using the datetime parameters to determine the most recent data.
To answer your questions, i want for each qst_id the latest answer version attached.
When a user changes the answer to a question it saves the new answer creating a new version (i.e. ver no 2) however even if the other answers haven't changed the system still records them as a new version(ver_no 2).
So overall it records the questions and answers again under the new ver_no. Also the date time is recorded along with the new ver_no for all questions when the user clicks save. as shown in the data.
This data is from an oracle database and is sorted by ver_no
Hope this helps.
Thanks

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1664 views
  • 0 likes
  • 4 in conversation