DATA Step, Macro, Functions and more

re-loading

Reply
Super Contributor
Posts: 673

re-loading

I will be getting data from 50 vendors every month.and I wont get them all at once.I load the data to master table.
It is difficult to keep track of the data that has been loaded.
Say for instance I loaded Caremark data day before yesterday and if I'm loading other data from different vendors today.Is there a way where even if I accidentally run the caremark script and yet not load (append) to master table if it is already present.

Key variables are "provider" and "file_date".
Super Contributor
Posts: 474

Re: re-loading

Yes off course.

- Match by key (implies reading both table, unless you use an index on the master table).

- Append the data, then eliminate duplicated rows by key.

- Keep track of the loaded keys in a smaller table, and check the table before appending data.

Latest being the less resource demanding, you need to make sure that every process who updates the master table, will insert the key in the tracking table.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
Respected Advisor
Posts: 4,173

Re: re-loading

Posted in reply to DanielSantos
Just to add another option to Daniel's list:
- use a control table where you keep track of which source file has already been loaded
N/A
Posts: 0

Re: re-loading

You want to check the documentation on the update statement, which was designed for this situation.

In the case you consider, if you run your script more than once, you will update the master data set to the same values it has already. Net result: no change.

The basic syntax is;

data master; /*or a new data set name*/
update master
transaction; by keyvars;

The master data set should be unique in the keyvars. The transaction data set can have multiple transactions.

See the documentation for the gory details.

Jonathan
Valued Guide
Posts: 2,177

Re: re-loading

append does provide the fastest solution, but seems to offer no protection against the scenario you describe (repeating the append), except for an almost hidden feature.
Put a unique index on your "master" dataset, and host it on the base SAS9 SPDE engine library. Then you can use the option UNIQSAVE=, to append with respect for the index.
See the example in the documentation at http://support.sas.com/documentation/cdl/en/engspde/61887/HTML/default/a002612268.htm

PeterC it is not a proc append option as much as a Data Set Option, to be applied to the BASE= data set. Message was edited by: Peter.C
Super Contributor
Super Contributor
Posts: 3,174

Re: re-loading

Remember to keep one or more backup copies of your master data in case a restore / recovery process is required -- best to test the process as well at some point. A backup might be the entire SAS data library (using PROC COPY) or maybe using PROC DATASETS and AGE to keep history "cycles" of data / control files.

Scott Barry
SBBWorks, Inc.
Super Contributor
Posts: 474

Re: re-loading

The SPDE engine per se, is quite a powerful feature, but I wouldn't recomend it for the average user, as it is an advanced feature very system dependent.

Beside the options stated above, I will add another that you consider or not.

Say you manage to store every received table with a unique name, you could then consolidate all the data on a single view.

For example, lets say you have one table per vendor, if and only if, each table name could be in some way associated with the corresponding vendor (for example vendor name or vendor ID), the you just have to keep all the tables in a single libname and create a view over them.

To access the data, you just have to open the view which will consolidate all the tables into a single one. And updating the data for a specified vendor is actually very easy, you just have to overwrite the vendor table with the one you received.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
Valued Guide
Posts: 2,177

Re: re-loading

Posted in reply to DanielSantos
I over-engineered with the SPDE solution.
Having a unique index is enough to reject rows already in the index.
The merit of that SPDE data set option (probably unneccessary here) is that rejected rows can be collected for reprocessing .. modify/update/sql/merge.
Since the required behaviour is just rejection, probably collecting the rejected "appends" is not needed.
This sasLog snippet demos the non-SPDE run[pre]407 **********************************************************************;
409 libname t (work) ;
NOTE: Libref T was successfully assigned as follows:
Levels: 1
Engine(1): V9
Physical Name(1): C:\Users\PETERC~1\AppData\Local\Temp\SAS Temporary Files\_TD8120
410 data t.name1 ;
411 input name $ value ;
412 list;cards ;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+-
413 Jack 123
414 Jill 3210
415 Hill 0
NOTE: The data set T.NAME1 has 3 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.00 seconds

416 ;
417 data t.name2( index=( name/unique ) ) ;
418 input name $ value ;
419 list;cards ;

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+-
420 Jill 4321
421 Hill 0
NOTE: The data set T.NAME2 has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.13 seconds
cpu time 0.00 seconds

422 ;
423
424
425 proc append data= name1 base= t.name2 ;*( uniquesave=yes) ;
426 run ;

NOTE: Appending WORK.NAME1 to T.NAME2.
WARNING: Duplicate values not allowed on index name for file NAME2, 2 observations rejected.
NOTE: There were 3 observations read from the data set WORK.NAME1.
NOTE: 1 observations added.
NOTE: The data set T.NAME2 has 3 observations and 2 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 0.10 seconds
cpu time 0.03 seconds[/pre]
note the "NOTE: 1 observations added."
Ask a Question
Discussion stats
  • 7 replies
  • 170 views
  • 0 likes
  • 6 in conversation