Joining 2 tables and removing duplicates

Reply
Contributor
Posts: 37

Joining 2 tables and removing duplicates

[ Edited ]

Hi everyone. I have the following 2 datasets

 

1st dataset (sample)

GVKEYYearMeansale
1001198310
1001198310
1001198310
1001198310
1001198415
1001198415
1001198415
1001198415
1001198415
………………………

 

Sample 2nd dataset (with lag_year = year (in 1st dataset) -1 )

GVKEYLag_Year
10011982
10011982
10011982
10011982
10011983
10011983
10011983
10011983
10011983
………………

 

I want to join the two table with the output like this

GVKEYLag_YearMeansale
10011982.
10011982.
10011982.
10011982.
1001198310
1001198310
1001198310
1001198310
1001198310
………………………

 

It seems very straightforward. But when I use proc sql left join 2nd dataset and 1st dataset on 1.gvkey=2.gvkey and 1.lag_year=2.year, it duplicates the matched observations.

 

For example, it produces 25 observations of matched GVKEY 1001 with year 1983 while I want only 5 matches like in the original data.

 

Is there any way to not include the duplicates with PROC SQL (or any other method?)

PROC Star
Posts: 442

Re: Joining 2 tables and removing duplicates

[ Edited ]
Posted in reply to trungcva112

you can do thiscouple of ways. this is untested code. i would prefer the second way

proc sql;

select distinct 1.GVKEY, lag_year, meansale

from table1 left join table2

on  1.gvkey=2.gvkey

and 1.lag_year=2.year;

 

or 

proc sql

select a.*,  meansale from 

(select distinct * from table2)a

left  join

(select distinct  GVKEY, year, meansale from table 2)b

on  a.gvkey=b.gvkey

and a.lag_year=b.year;

 

Super User
Posts: 21,546

Re: Joining 2 tables and removing duplicates

Posted in reply to trungcva112

I suspect you need another criteria on your join. 

 

It looks like at present you have multiples by your join key variables, 

GVKEY and YEAR, in each data set. If this is correct, then for every N1 records in Table1 and N2 records in Table2 for a given GVKEY and YEAR you'll get N1*N2 records. 

 

 

Trusted Advisor
Posts: 1,683

Re: Joining 2 tables and removing duplicates

[ Edited ]

Does table 2 contain only the two variables - GVKEY and Lag_Year ?

@trungcva112 wrote "2nd dataset (with lag_year = year (in 1st dataset) -1 )" - 

does it mean that the 2nd table was derived from table1 ?

 

If positive, you can create the wanted table directly from table1:

proc sort data=tabl1 out=temp nodupkey ; by gvkey year. run;
data want; set temp;
by gvkey year.
retain year1; drop year1;
if first.gvkey then year1=year; lag_year = year - 1;
if lag_year = year1 then meansale = .; run;
PROC Star
Posts: 831

Re: Joining 2 tables and removing duplicates

Posted in reply to trungcva112

Is there any way to not include the duplicates with PROC SQL (or any other method?)


Why not a simple merge after sorting. If cartesian is not considered an overhead before left join on equality operator to filter your needs, I am sure sort should be acceptable. Else, load your have1 with multidata:yes in a hash object choosing keys gvkey and year. look up from have2 in the set statement with hash.find() method else call missing(meansale). Good night!

 

data have1;
input GVKEY $ Year Meansale ;
datalines;
1001 1983 10
1001 1983 10
1001 1983 10
1001 1983 10
1001 1984 15
1001 1984 15
1001 1984 15
1001 1984 15
1001 1984 15
;


data have2;
input GVKEY $ Lag_Year;
datalines;
1001 1982
1001 1982
1001 1982
1001 1982
1001 1983
1001 1983
1001 1983
1001 1983
1001 1983
;

data want;
merge have2(in=a) have1(rename=(Year=Lag_Year));
by gvkey lag_year;
if a;
run;

Ask a Question
Discussion stats
  • 4 replies
  • 133 views
  • 0 likes
  • 5 in conversation