Solved: Re: [PROBLEM]Merging with Indexed Dataset .

ankit___gupta · Posted 01-23-2017 04:08 AM

I have two tables MASTER (30 million rows), TRANSACTION (300 rows).Before applying index, I was using the following code.

data MASTER;

merge MASTER TRANSACTION;

by ID;

run;

The code resulted in data from both the data set merged. (Thus containing in total 30 million rows).

After Applying index to the MASTER on (ID and LOCATION) using

proc sql;

create index inde on MASTER(ID, LOCATION);

quit;

Now here lies the problem on how to merge.

[PROBLEM] Now Merging them using index.

[I got this code.]

data MASTER ;

set TRANSACTION;

set MASTER key = inde;

if _IORC_ = 0 THEN MASTER;
ELSE MASTER;

run;

But it is not resulting in the same merge as mentioned earlier.

Also if there is possibility for a solution on the above using simple index ?

mkeintz · Posted 01-23-2017 05:45 AM

You want to replicate the results of your MERGE program with a modify program, thereby savings lots of input/output.

If there are variables in the transaction dataset not in the master dataset, they will be added to the vars in the master dataset. This cannot be replicated using modify.

But assuming there are no new vars in the transaction data set, then you could do something like this to replicate the merge:

data master;
  set transaction;
  modify master key=inde;
  set transaction (drop=recordcreation);
  if _iorc_=0 then replace;
  else output;
  call missing(of _all_);
run;

Notes:

The reason for the second "set transaction" statement is because a successful "modify" statement would overwrite values from the first "set transaction" - the opposite of your merge program. Actually the first "set transaction" could be revised to
set transaction (keep=id location)
to make clear the purpose of each set statement .
If there is a new record (i.e. new id/location) in dataset transaction and if a variable (say var SUPPLIER) is in master but not transaction, then supplier would inherit its value from the most recent successful modify statement, when it should have a missing value. That's why there is a "call missing" statement.
I should add that the replication is accurate if there is only one record per index value in each dataset.
The "drop=recordcreation" parameter was added to the 2nd "set transaction" to accomodate a revision to the requirement - namely to keep the value from master whenever possible, and take the value from transaction only when there is a new record. Of course, this also breaks the property of exactly replicating the results of the merge program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Kurt_Bremser · Posted 01-23-2017 04:30 AM

What is your intended result? 300 million observations or 300 observations?

Where did you get that code

data MASTER ;
set TRANSACTION;
set MASTER key = inde;
if _IORC_ = 0 THEN MASTER;
ELSE MASTER;
run;

from?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ankit___gupta · Posted 01-23-2017 06:44 AM

My Intended result is 300 million rows. The rows which are present in both should be overwritten in MASTER table. This should be accomplished using index.

DanielSantos · Posted 01-23-2017 04:47 AM

Hi.

Not sure of what you are trying to do, but normally it should be something like this:

proc sql;
create index ID on TRANSACTION(ID); * simple index name should be the column name;
quit;

data MASTER ;
set MASTER ;
set TRANSACTION key = ID;
if _IORC_ = 0; * output when match;
run;

Index should be created for the smaller table, which I assume here is TRANSACTION.

Then you can use _IORC_ to do whatever you want (output or not), being 0 a match 1 a mismatch.

Of course if you do the merge via SQL, SAS will handle the index automatically for you (if possible).

Hope it helps.

Daniel Santos @ www.cgd.pt

mkeintz · Posted 01-23-2017 05:45 AM

You want to replicate the results of your MERGE program with a modify program, thereby savings lots of input/output.

If there are variables in the transaction dataset not in the master dataset, they will be added to the vars in the master dataset. This cannot be replicated using modify.

But assuming there are no new vars in the transaction data set, then you could do something like this to replicate the merge:

data master;
  set transaction;
  modify master key=inde;
  set transaction (drop=recordcreation);
  if _iorc_=0 then replace;
  else output;
  call missing(of _all_);
run;

Notes:

The reason for the second "set transaction" statement is because a successful "modify" statement would overwrite values from the first "set transaction" - the opposite of your merge program. Actually the first "set transaction" could be revised to
set transaction (keep=id location)
to make clear the purpose of each set statement .
If there is a new record (i.e. new id/location) in dataset transaction and if a variable (say var SUPPLIER) is in master but not transaction, then supplier would inherit its value from the most recent successful modify statement, when it should have a missing value. That's why there is a "call missing" statement.
I should add that the replication is accurate if there is only one record per index value in each dataset.
The "drop=recordcreation" parameter was added to the 2nd "set transaction" to accomodate a revision to the requirement - namely to keep the value from master whenever possible, and take the value from transaction only when there is a new record. Of course, this also breaks the property of exactly replicating the results of the merge program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ankit___gupta · Posted 01-24-2017 02:47 AM

Hi mkeintz,

your reply was totally accurate. I have just one other question to the query.

Currently it is replacing all the column in MASTER from TRANSACTION.

But i have one particular column (column name : RecordCreation) in Master and TRANSACTION that needs to be retained in MASTER and replace the other column.Also if the data was present only in transaction then put the entire to MASTER.

Please help with the above.

Thanks in Advance.

mkeintz · Posted 01-24-2017 09:04 AM

Then, in the second SET TRANSACTION, modify it to

set transaction (drop=recordcreation)

See my editted program in the prior topiic message.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!