BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ankit___gupta
Quartz | Level 8

I have two tables MASTER (30 million rows), TRANSACTION (300 rows).Before applying index, I was using the following code.

                data MASTER;

                     merge MASTER TRANSACTION;

                     by ID;

                run;

The code resulted in data from both the data set merged. (Thus containing in total 30 million rows).

 

 

 After Applying index to the MASTER on (ID and LOCATION) using

                      proc sql;

                                create index inde on MASTER(ID, LOCATION);

                      quit;

 

Now here lies the problem on how to merge. 

[PROBLEM] Now Merging them using index.

 

[I got this code.]

 

                               

                          data MASTER ;

                                    set TRANSACTION;

                                    set MASTER key = inde;

                                    if _IORC_ = 0 THEN MASTER;
                                    ELSE MASTER;

                         run;

 

But it is not resulting in the same merge as mentioned earlier.

 

Also if there is possibility for a solution on the above using simple index ?

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You want to replicate the results of your MERGE program with a modify program, thereby savings lots of input/output.

 

If there are variables in the transaction dataset not in the master dataset, they will be added to the vars in the master dataset. This cannot be replicated using modify.

 

But assuming there are no new vars in the transaction data set, then you could do something like this to replicate the merge:

 

data master;
  set transaction;
  modify master key=inde;
  set transaction (drop=recordcreation);
  if _iorc_=0 then replace;
  else output;
  call missing(of _all_);
run;

 

Notes:

  1. The reason for the second "set transaction" statement is because a successful "modify" statement would overwrite values from the first "set transaction" - the opposite of your merge program. Actually the first "set transaction" could be revised to
       set transaction (keep=id location)
    to make clear the purpose of each set statement .

  2. If there is a new record (i.e. new id/location) in dataset transaction and if a variable (say var SUPPLIER) is in master but not transaction, then supplier would inherit its value from the most recent successful modify statement, when it should have a missing value.  That's why there is a "call missing" statement.

  3. I should  add that the replication is accurate if there is only one record per index value in each dataset.

  4. The "drop=recordcreation" parameter was added to the 2nd "set transaction" to accomodate a revision to the requirement - namely to keep the value from master whenever possible, and take the value from transaction only when there is a new record.  Of course, this also breaks the property of exactly replicating the  results of the merge program.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
Kurt_Bremser
Super User

What is your intended result? 300 million observations or 300 observations?

 

Where did you get that code

data MASTER ;
set TRANSACTION;
set MASTER key = inde;
if _IORC_ = 0 THEN MASTER;
ELSE MASTER;
run;

from?

ankit___gupta
Quartz | Level 8

My Intended result is 300 million rows. The rows which are present in both should be overwritten in MASTER table. This should be accomplished using index.

DanielSantos
Barite | Level 11

Hi.

 

Not sure of what you are trying to do, but normally it should be something like this:

 

proc sql;
create index ID on TRANSACTION(ID); * simple index name should be the column name;
quit;

data MASTER ;
set MASTER ;
set TRANSACTION key = ID;
if _IORC_ = 0; * output when match;
run;

Index should be created for the smaller table, which I assume here is TRANSACTION.

 

Then you can use _IORC_ to do whatever you want (output or not), being 0 a match 1 a mismatch.

 

Of course if you do the merge via SQL, SAS will handle the index automatically for you (if possible).

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

mkeintz
PROC Star

You want to replicate the results of your MERGE program with a modify program, thereby savings lots of input/output.

 

If there are variables in the transaction dataset not in the master dataset, they will be added to the vars in the master dataset. This cannot be replicated using modify.

 

But assuming there are no new vars in the transaction data set, then you could do something like this to replicate the merge:

 

data master;
  set transaction;
  modify master key=inde;
  set transaction (drop=recordcreation);
  if _iorc_=0 then replace;
  else output;
  call missing(of _all_);
run;

 

Notes:

  1. The reason for the second "set transaction" statement is because a successful "modify" statement would overwrite values from the first "set transaction" - the opposite of your merge program. Actually the first "set transaction" could be revised to
       set transaction (keep=id location)
    to make clear the purpose of each set statement .

  2. If there is a new record (i.e. new id/location) in dataset transaction and if a variable (say var SUPPLIER) is in master but not transaction, then supplier would inherit its value from the most recent successful modify statement, when it should have a missing value.  That's why there is a "call missing" statement.

  3. I should  add that the replication is accurate if there is only one record per index value in each dataset.

  4. The "drop=recordcreation" parameter was added to the 2nd "set transaction" to accomodate a revision to the requirement - namely to keep the value from master whenever possible, and take the value from transaction only when there is a new record.  Of course, this also breaks the property of exactly replicating the  results of the merge program.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ankit___gupta
Quartz | Level 8

Hi mkeintz

 

 

 

 

mkeintz
PROC Star

Then, in the second SET TRANSACTION, modify it to

    set transaction (drop=recordcreation)

 See my editted program in the prior topiic message.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1365 views
  • 1 like
  • 4 in conversation