Re: Duplicate values after join

Dogo23 · Posted 01-03-2018 06:07 PM

Hi all,

I'm assuming this is a simple answer but here it goes:

I have two tables for an email campaign: one with engagement levels, the other with revenue. I have the data rolled up to a regional grouping, and the unique ID's for each campaign are in both tables, so that is what I am using to join the two tables together. I basically want by campaign, by group - the engagements and revenue combined.

My problem is that I am getting three matching rows returned now for each value that should be unique.

Note that there are some campaigns without revenue, so I am using a left join to show the campaigns in the engagement table, and then also revenue where there is a match.

proc sql;
create table match as
select
distinct a.campaign_id,
a.region,
a.program,
a.processed_dttm,
a.sent,
a.open,
a.click,
b.bookings,
b.revenue
from table.engagement a
left join
table.revenue b
on a.campaign_id_id = b.campaign_id
where a.campaign_id is not null
group by 1;
quit;

Thanks for you help ahead of time!

Reeza · Posted 01-03-2018 06:14 PM

That means one of your tables has duplicates. Figure out which one first.

Dogo23 · Posted 01-03-2018 06:19 PM

Just checked both tables and there are no duplicates. There's a unique row for each row with the rolled up data.

Dogo23 · Posted 01-03-2018 06:23 PM

Sorry, I meant to say unique row for each group of rolled up data in the table.

Reeza · Posted 01-03-2018 06:32 PM

Check your log, do you have an error in your code?

on a.campaign_id_id = b.campaign_id

Otherwise, If that join is correct with the variable names then you shouldn't get duplicates.

Please post the output from the following:

proc sql;
select count(campaign_id_id) as N, count(distinct campaign_id_id) as N_Distinct
from table.engagement;
select count(campaign_id) as N, count(distinct campaign_id) as N_Distinct
from table.revenue;
quit;

Dogo23 · Posted 01-03-2018 06:42 PM

Dogo23 · Posted 01-03-2018 06:46 PM

Sorry, the first one I posted had distinct for both columns, here's an updated version:

ballardw · Posted 01-03-2018 07:11 PM

When your N is 2458 and the N_distinct is 887 then you either have one of the campaign_id values duplicated about 1570 times or many of the campaign_id values have one or more duplicates, averaging about 2.8 observations per id.

Perhaps your Where clause is referencing the wrong Id variable?

Reeza · Posted 01-03-2018 08:36 PM

@schlotty23 wrote:

Sorry, the first one I posted had distinct for both columns, here's an updated version:

That means you have duplicates....in both data sets.

What makes you think it's unique? Are you sure you're using the correct files, joining on the correct variables?

on a.campaign_id_id = b.campaign_id
where a.campaign_id is not null

Do you have two variables or should the second one be campaign_id_id?

Either way, you have duplicates, so you'll get multiples.

Dogo23 · Posted 01-04-2018 02:51 PM

Figured it out. I had to join both on the campaign id AND the region in the join in order to obtain the desired results.

Thanks all!

Registration is open

SAS Training: Just a Click Away