Solved: Re: Merge

robertrao · Posted 08-08-2013 11:41 AM

Hi,

I posted this question yesterday and I am reframing my question again

The criteria is that there are two datasets one with placed times and the other for removal times.

The removal time has to be greater than the placed time and

The placed-removal times should be closest to each other

ALSO When we I merge I am getting repeats of BY variables

ID PLACED

101   06JUN2013:04:50:00
101   06JUN2013:04:57:00
101   11JUN2013:17:45:00
101   11JUN2013:19:38:00
102   18JUN2013:06:30:00

ID       REMOVED
101    11JUN2013:13:35:00
101    15JUN2013:14:30:00
102    19JUN2013:22:40:00
102    27JUN2013:16:35:00
102    27JUN2013:17:59:00

WANT

ID    PLACED                 REMOVED
101 11JUN2013:19:38:00     15JUN2013:14:30:00
102 18JUN2013:06:30:00     19JUN2013:22:40:00

Thanks

Vince28_Statcan · Posted 08-08-2013 04:02 PM

Indeed you can't ignore it, I had missed on that comment. SAS merge statement does not do cartesian product merging when you have many-to-many merges. If you search for MERGE in SAS help, you can find the following quote:

Note: The MERGE statement does not produce a

Cartesian product on a many-to-many match-merge. Instead it performs a

one-to-one merge while there are observations in the BY group in at least one

data set. When all observations in the BY group have been read from one data set

and there are still more observations in another data set, SAS performs a

one-to-many merge until all observations have been read for the BY group.

There are ways to do many-to-many merges through datasteps like using multidata enabled hash objects but the simple approach, again, is to fall back to proc sql that does a wonderful job.

proc sql;

create table merged as

select t1.id, t1.placed, t2.removed

from table1 as t1

full join

table2 as t2

on t1.id=t2.id;

quit;

SQL also allows to do merge on intermediately calculated variables or on variables with different names (not that adding a rename statement as a dataset option is very complicated) but it is somewhat more flexible than data step on quite a few things.

Plus, with SAS9.4 coming, you will be able to use SQL syntax in proc DS2 (datastep2 new syntax) to retrieve tables for the Set statement so it's definitely worth taking the time to learn.

Vincent

*edit SASKiwi's SQL syntax is equivalent to the above

View solution in original post

Vince28_Statcan · Posted 08-08-2013 03:11 PM

An easy but not processing-efficient solution would be to use sql summary functions to only retain the record, from your merged dataset, such that removed-placed is minimal.

proc sql;

create table want as

select *

from merged

group by id

having min(removed-placed) and removed>placed;

quit;

Assuming your placed and removed columns are effectively datetime values and not character strings. If they're character strings, you can change

having min(removed-placed);

to

having min(input(removed, datetime18.)-input(placed, datetime18.)) and input(removed, datetime18.) GT input(placed, datetime18.);

or well, something similar (sometimes informats disapprove width and require datetime. informat or something similar).

robertrao · Posted 08-08-2013 03:23 PM

Hi,

Thanks for the reply.

How can we acheive this with simple datasteps ????

thanks in advance

Vince28_Statcan · Posted 08-08-2013 03:31 PM

hrm I would be inclined to say take a stab at proc sql; it simplifies a lot of multiple step (spaghetti code) data syntax and improves readability. However, here's a 3 step approach that could achieve the same with data step syntax

/* step1 calculate the difference between removed and placed and put it in a new variable */

data merged;

set merged;

diff=removed-placed;

if dif>0 then output;

end;

/* Step 2, sort by id & diff so that the minimum diff (only positive ones leftover from step1) can be targetted with first.diff */

proc sort data=merged;

by id diff;

run;

/* Step 3, output only first.diff with by group processing into a new dataset */

data want;

set merged;

by id;

if first.diff then output;

run;

This should do more or less the same as the single step sql syntax proposed above.

Vincent

robertrao · Posted 08-08-2013 03:40 PM

Hi

Thanks again.

But when we merge prior to the diff calculation we get message note: Merge has more than 2 datasets with repeats of BY variables??

Can we ignore it??????

Thanks

SASKiwi · Posted 08-08-2013 03:52 PM

No you shouldn't ignore the message. The MERGE statement can only handle one-to-one and one-to-many joins, not many-to-many. Your merged data will drop the duplicate IDs from your second dataset (marked as follows in bold):

ID REMOVED

101 11JUN2013:13:35:00

101 15JUN2013:14:30:00

102 19JUN2013:22:40:00

102 27JUN2013:16:35:00

102 27JUN2013:17:59:00

Join your two datasets first using SQL to avoid this:

proc sql

create merged as

select *

from placed,

removed

where placed.id = removed.id

;

quit;

Vince28_Statcan · Posted 08-08-2013 04:02 PM

Indeed you can't ignore it, I had missed on that comment. SAS merge statement does not do cartesian product merging when you have many-to-many merges. If you search for MERGE in SAS help, you can find the following quote:

Note: The MERGE statement does not produce a

Cartesian product on a many-to-many match-merge. Instead it performs a

one-to-one merge while there are observations in the BY group in at least one

data set. When all observations in the BY group have been read from one data set

and there are still more observations in another data set, SAS performs a

one-to-many merge until all observations have been read for the BY group.

There are ways to do many-to-many merges through datasteps like using multidata enabled hash objects but the simple approach, again, is to fall back to proc sql that does a wonderful job.

proc sql;

create table merged as

select t1.id, t1.placed, t2.removed

from table1 as t1

full join

table2 as t2

on t1.id=t2.id;

quit;

SQL also allows to do merge on intermediately calculated variables or on variables with different names (not that adding a rename statement as a dataset option is very complicated) but it is somewhat more flexible than data step on quite a few things.

Plus, with SAS9.4 coming, you will be able to use SQL syntax in proc DS2 (datastep2 new syntax) to retrieve tables for the Set statement so it's definitely worth taking the time to learn.

Vincent

*edit SASKiwi's SQL syntax is equivalent to the above

Registration is open

SAS Training: Just a Click Away