Re: Matching Observations / Cartesian Product w/o Duplicates

mbaugh · Posted 04-16-2015 09:24 PM

Hi,

I have a data set something to the effect of:

ID Year Industry

1 2005 1

1 2006 1

2 2005 1

2 2006 1

3 2005 2

3 2006 2

4 2005 2

4 2006 2

5 2005 2

5 2006 2

I need to create a data set with all possible firm pairs within industries and years, something to the effect of:

ID1 ID2 Year Industry

1 2 2005 1

3 4 2005 2

3 5 2005 2

4 5 2005 2

I've tried copying the original table and merging the copy back to the original using a Cross Join (Cartesian Product) in Proc SQL. The problem with that is it effectively double counts pairs such that it creates data lines for both:

ID1 ID2 Year Industry

1 2 2005 1

2 1 2005 1

These are duplicate entries for what I need.

Does anyone know how I can pair within a group without using a full Cross Join / Cartesian Product?

Thanks!

Ksharp · Posted 04-17-2015 10:20 AM

If I understood what you mean.

data have;
input ID     Year     Industry ;
cards;
1     2005     1    
1     2006     1
2     2005     1
2     2006     1
3     2005     2
3     2006     2
4     2005     2
4     2006     2
5     2005     2    
5     2006     2
;
run;
proc sort data=have;by Year     Industry ;run;
data want;
 set have;
 by Year Industry ; 
 array x{99999} _temporary_;
 if first.Industry then n=0;
 n+1;x{n}=id;
 if last.Industry then do;
   do i=1 to n-1;
    do j=i+1 to n;
      id1=x{i};id2=x{j};output;
     end;
   end;
 end;
drop n i j id;
run;

Xia Keshan

Reeza · Posted 04-17-2015 10:21 AM

I think your request is the definition of what a cross join is, but you can add a condition to limit the results:

WHERE ID1 LT ID2

OR

WHERE ID1 LE ID2

Haikuo · Posted 04-17-2015 11:21 AM

Following 's suggestion:

data have;

input ID Year Industry;

cards;

1 2005 1

1 2006 1

2 2005 1

2 2006 1

3 2005 2

3 2006 2

4 2005 2

4 2006 2

5 2005 2

5 2006 2

;

run;

proc sql;

select distinct a.id as id1, b.id as id2, a.year, a.industry

from have a, have b

where a.year=b.year and a.industry=b.industry and a.id < b.id

order by a.year;

quit;

mbaugh · Posted 04-17-2015 12:03 PM

Thank you!

Why is the last where statement a.id < b.id and not a.id<>b.id?

ballardw · Posted 04-17-2015 12:07 PM

Run it and see. Hint: 1 <> 2 and 2<>1.

JMCass · Posted 03-04-2020 04:27 PM

So simple. Thanks!

Matching Observations / Cartesian Product w/o Duplicates