BookmarkSubscribeRSS Feed
DG1984
Fluorite | Level 6

I currently use the below code to insert data from work into a permanent table.  I was wondering if there is a more efficient way of doing this?

The process is:

Import CSV

Check to see if transaction ID is in Permanent SAS table if not insert

Data Work.temp;
Set Lib.Permanent_Table  (firstobs=30000000);
Keep Transaction_ID;
Run;



Proc sql;
INSERT INTO Lib.Permanent_Table 
SELECT * FROM WORK.Table01
WHERE 'Transaction_Id'n NOT IN
(SELECT 'Transaction_Id'n FROM work.temp)
;
drop table work.temp
;
QUIT;

 

The Permanent table has 41,995,622 observations and I am inserting another 150k every day with 90 columns.

 

Thanks

 

 

 

8 REPLIES 8
Astounding
PROC Star

A couple of ideas to try ...

 

First, create the set of IDs more quickly:

 

data work.temp;

set libpermanent_Table (firstobs=30000000 keep=Transaction_ID);

run;

 

Your original data step read in all 90 variables, then dropped 89 of them.

 

Second (since I'm not really familiar with the speed of SQL's INSERT, so this would be an experiment), create the subset of Table_01 separately from adding it to the permanent data set:

 

proc sql;

create table subset as 

select * from work.Table01

where Transaction_ID not in

select Transaction_ID from work.temp;

quit;

 

Then appending might be much faster:

 

proc append data=subset base=lib.Permanent_Table;

run;

 

The variable definitions should match 100% to do this, however.  And it should be an acceptable solution to add the new observations to the end of the existing data set.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

To think of it the other way round:

proc sql;
  create table data_to_add as 
  select * 
  from   csv_data
  where transaction_id not in (select distinct transaction_id from permanent_table);
run;

proc append base=permanent_table data=data_to_add;
run;

Avoid this: 

'Transaction_Id'n

If you are using SAS, use SAS naming conventions for your variables.  The only time '  'n named literals do anything other than make your code more messy is when you have to deal with Excel or other poor datasources.

mkeintz
PROC Star

Ninety columns?  150K new observations daily?  I'd suggest maintaining a second dataset of just the current transaction_id's (call it lib.id_list) .  Then you could

 

  1. Read the new transactions and see if they are already in lib.id_list, using a hash object from lib.id_list.
  2. If not then
    1. put the transaction in a data set (view) to be appended   (data_to_append)
    2. record the new transaction_id to be appended to the id_list (new_ids)
  3. append 2.1 and 2.2 to their respective master data set files.

 

data new_ids (keep=transaction_id)
     data_to_append   /view=data_to_append;

  set new_transactions;
  if _n_=1 then do;    
    declare hash h (dataset:'lib.id_list');
      h.definekey('transaction_id');
      h.definedone();
  end;
  if h.find() ^=0;
run;

proc append base=lib.permanenttable append=data_to_append;
run;
proc append base=lib.id_list data=new_ids;
run;

 

Edited additional note.  If your original  41,995,622 observations all have unique transaction_id's, then I would modify the program above to establish as many "buckets" as possible in hash object h.  So change the DECLARE statement to

   declare hash h (dataset:'lib.id_list',hashexp:16);

which tells SAS to make 2**16=64K buckets, each with about 641 transaction_id's.  The h.find() method will surely be faster than with the default hashexp:8 (2**8 buckets each with about 164,000 transaction_id's).  hashexp:16 is the maximum supported value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
DG1984
Fluorite | Level 6
Will definitely give this a try and compare with a few others I have been given!!

Thanks

All for help!
PGStats
Opal | Level 21

First, make sure you have an index defined on Transaction_Id in your permanent table.

 

Then try:

 

Proc sql;

create table Table02 as
select * from Table01
where Transaction_Id not in (select Transaction_Id from lib.Permanent_Table);

INSERT INTO Lib.Permanent_Table 
SELECT * FROM WORK.Table02;

drop table work.table02;
QUIT;

(untested)

 

PG
mkeintz
PROC Star

@PGStats

 

Your code will certainly be superior to the original OP code.  But an index on 41M unique transaction id's (I assume they are unique) will yield a binary search tree with depth about 25.  I would be interested in comparing its performance to the hash solution I suggested.  This has long been one of the established advantages of hash search vs binary search, and I presume it would apply in this case.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LinusH
Tourmaline | Level 20
With such size of a table and inserting to it which causing an index update, I would definitely move it to a SPDE library.
Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2028 views
  • 8 likes
  • 7 in conversation