Hello,
I have 5 tables that contain around 5 millions rows.
My program read and extract all rows (on tuesday, on wednerday, thursday, friday and saturday)
I would like to optimize by doing a delta loading instead of full loading. I would like to extract only new rows since the last extraction and then append this new records in a target table
What is the "best" method ?
1°) should I compare source table and target tables to identify wew records ?
2°) should I save (each day) the extraction date in a track table. and then reading this date to extract news record of the source table ?
3°) should I read the max value of column "update_date" of the source table.
thanks in advance for your help / advice
kind regards
Nass
So you only need
proc sql noprint;
select max(load_date) into :max_load
from target;
quit;
data target_new;
set
target
source (where=(load_date gt &max_load))
;
run;
Why do you need to check the date at all? Is there a chance you might load the same date twice?
To read the latest date, and provided new data is always APPENDed, just read the last record.
set TABLE nobs=NOBS point=NOBS;
Thanks Chris for your help.
yes, by checking the date, if I have to re launch the program (in the day) , I Always extract only the new data. the new data should be inserted only once in the target even if the program is launch twice or more.
Also, I am sorru but i don't understand this "set TABLE nobs=NOBS point=NOBS;"
I think I understood, you advice me to get only last record by using this (for example)
data last;
/* get number of records (N) */
if 0 then
set sashelp.class nobs=nobs end=eof;
/* use POINT= to get nth record */
set sashelp.class point=nobs;
output;
stop;
run;
You haven't really described your loading process completely. Are you doing only row additions or are you also doing row updates and deletions? Also do you want your load process to be repeatable, that is remove all rows from a particular load then load them again?
Yes exactly. My code does the same in one line.
Maxim 2 says: Read the Log.
The log is the #1 resource for detecting problems in SAS code. The log, the whole log, and nothing but the log.
So please copy/paste the whole log of the step that resulted in a problem into a window opened with this button:
1 The SAS System Monday, October 11, 2021 11:12:00 AM 1 ;*';*";*/;quit;run; 2 OPTIONS PAGENO=MIN; 3 %LET _CLIENTTASKLABEL='DTM_SIMU_VDELTA'; 4 %LET _CLIENTPROCESSFLOWNAME='Process Flow'; 5 %LET _CLIENTPROJECTPATH=''; 6 %LET _CLIENTPROJECTPATHHOST=''; 7 %LET _CLIENTPROJECTNAME=''; 8 %LET _SASPROGRAMFILE='/home/ldap/rollens/sbsas/ROLLENS/CRC/DTM_SIMU_VDELTA.sas'; 9 %LET _SASPROGRAMFILEHOST='isasakj4.compagny.fr'; 10 11 ODS _ALL_ CLOSE; 12 OPTIONS DEV=PNG; 13 GOPTIONS XPIXELS=0 YPIXELS=0; 14 FILENAME EGSR TEMP; 15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR 16 STYLE=HTMLBlue 17 STYLESHEET=(URL="file:///D:/appli/SAS/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css") 18 NOGTITLE 19 NOGFOOTNOTE 20 GPATH=&sasworklocation 21 ENCODING=UTF8 22 options(rolap="on") 23 ; NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR 24 25 GOPTIONS ACCESSIBLE; 26 set TABLE nobs=NOBS point=NOBS; ___ 180 ERROR 180-322: Statement is not valid or it is used out of proper order. 27 28 29 GOPTIONS NOACCESSIBLE; 30 %LET _CLIENTTASKLABEL=; 31 %LET _CLIENTPROCESSFLOWNAME=; 32 %LET _CLIENTPROJECTPATH=; 33 %LET _CLIENTPROJECTPATHHOST=; 34 %LET _CLIENTPROJECTNAME=; 35 %LET _SASPROGRAMFILE=; 36 %LET _SASPROGRAMFILEHOST=; 37 38 ;*';*";*/;quit;run; 39 ODS _ALL_ CLOSE; 40 41 42 QUIT; RUN; 43
Hello Kurt
I put the whole log
thanks
SET is a DATA step statement and must be part of a data step.
yes, but when I try this data step bellow. I encounter an error "out of memory" and the result is very big "t_last table" table.
data t_last ;
set sashelp.class nobs=NOBS point=NOBS;
run;
I know that this code bellow works well==>
data last ;
/* get number of records (N) */
if 0 then
set sashelp.class nobs=p_nobs end=eof;
/* use POINT= to get nth record */
set sashelp.class point=p_nobs;
output;
stop;
run;
Chris said "Yes exactly. My code does the same in one line." sorry, but I Don't know how.
Thanks
When you use SET in a data step without the POINT= option, the incoming data will be read sequentially, and as soon as the SET statement (which is also a READ under the hood) tries to read past the end, the data step will automatically terminate gracefully.
But with the POINT= option, there is no sequential read, and the SET can happily read the observation again and again and again (and write out an observation for each iteration) until the resulting dataset cracks storage limits.
This why you always need a means to control how many observations are processed, either by using SET in a do loop with a termination condition, or using the STOP statement once you've achieved your goal.
Since a STOP also prevents the implicit write, you need to add OUTPUT:
data t_last ;
set sashelp.class nobs=NOBS point=NOBS;
output;
stop;
run;
thanks Kurt for the explanations. I have no more errors.
but the last obs I get returns the variables "date_update" "and" date_load "with the values ". "
And I don't understanbd this result because by getting min and max of this 2 variables with proc sql, result is
"02JUN2014 / 09OCT2021 / 02JUN2014 / 08OCT2021"
so why the step with "point=..." returns "." ? and does not return a "09OCT2021" and "08OCT2021"
Since we still do not know what we are dealing with, you must now provide examples for the existing dataset, the "update" dataset, and how the final dataset from that should look like.
For the two source datasets, use data steps with datalines (do not skip this), so we can easily recreate them for developing/testing in or environment.
you are right. i give you a sample of data
data source ; length project_name $ 9; input num_client project_name $ project_amount date_load date_update ; informat date_load date_update ddmmyy10.; format date_load date_update ddmmyy10.; datalines; 280001 home 15000 01/04/2019 20/04/2019 280002 car 3000 01/05/2019 . 280003 personnal 2250 30/06/2019 . 280004 home 25000 31/07/2020 01/08/2020 280005 car 12000 21/01/2021 . 280006 personnal 1500 20/02/2021 . 280007 home 75500 12/04/2021 12/05/2021 280008 car 42000 27/05/2021 . 280009 personnal 2000 05/09/2021 10/10/2021 280010 home 2000 10/10/2021 . ; run;
in the target table, amount of the client 28009 is 1000 (and not 2000 yet). and the client 20010 does not exist yet
I would like to identify client 28009 the be updated ans client 20010 to be inserted.
I Don't know the best way to do that.
there is no dimensional (star schema) with fact and dimension tables with keys)
this table is "alone"
thanks
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.