Re: Question on left join

yichentian226 · Posted 09-25-2020 04:39 PM

Hi, below is my query and on the screenshot it's how the two datasets look like and what I want to get.

proc sql;
create table withbounceclick as
select e.*,
b.URL, b.LinkName, b.clickflag, b.ClickDate
from email.withbounce_new1 e
left join email.EnterpriseDataViewClick b on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId);
QUIT;

However, this is giving me more rolls comparing to dataset: email.withbounce_new1... this is confusing to me because I was expecting to see same number of rolls in dataset: withbounceclick and those in withbounce_new1. Thanks in advance!

Reeza · Posted 09-25-2020 04:56 PM

You have duplicates of Email Address and/or Job ID in your data set. If you have 2 emails in one and 3 of the same email in the other data set, it will join with all, giving you 3 x 2 = 6 rows in the output. Generally, the solution is to modify your join condition.

yichentian226 · Posted 09-25-2020 05:00 PM

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much

ballardw · Posted 09-25-2020 05:06 PM

@yichentian226 wrote:

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much

So getting multiple records means that you have duplicates of both variables in combination.

Not efficient but you can try

Select Distinct e.*

which should yield non-dupllcated rows. If you expect to only have one email and jobid combination in the output then you may need to provide more rules of how to determine which is supposed to be the single output record.

yichentian226 · Posted 09-25-2020 05:26 PM

Thanks! So my dataset is kinda like that, let's say dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddressa and sentdate. Dataset B has more records because it has all the emails we sent. And of course many duplicate jobid and emailaddress. I am not sure if I can use distinct in this case?

Reeza · Posted 09-25-2020 05:15 PM

@yichentian226 wrote:

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much

That requires knowledge of the data, and we don't have that. That's part of your requirements.

If you have other variables in your data set the distinct won't help either because you may not have exact duplicates. This is a logical business decision, not a programming question but one YOU need to answer. Is there another field that uniquely identifies each row that is common? Does it need to be based on some date range - very common in medical and insurance data? If the other fields don't matter you can do a distinct but if they do then you need to make your business decisions. And not all variables may be treated the same - some you may need to summarize with a sum, but others may need to use a mean or median.

yichentian226 · Posted 09-25-2020 05:33 PM

This is how the data looks like: dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddress and sentdate. Dataset B has more records because it has all the emails we sent. So here it definitely makes sense the duplicate you mentioned, but I'm thinking if we can make use of the date to avoid duplicate?

Reeza · Posted 09-25-2020 06:05 PM

Data set A - I'm assuming this is the click data for your email campaigns.

Do you care if someone clicked on something more than once? Does it really affect things?
I would consider simplifying that file to first open/click date so you have only one record per JobID and emailAddress. I would simplify it to first date opened, last date opened and number of opens/reads.

I'm assuming dataset B has a unique records entirely - if it doesn't that's a different problem - why do you allow people to have multiple email addresses connected to multiple accounts?

Then I'd join my simplified file.

yichentian226 · Posted 09-26-2020 12:15 AM

Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened.

yichentian226 · Posted 09-26-2020 12:28 AM

I am thinking maybe I should do a left outer join?

Kurt_Bremser · Posted 09-26-2020 03:25 AM

Don't think about tools yet. You have to clarify other things first.

what do the multiple observations in B show? Are those different emails sent to the same mail address, or the same mail sent to one person under several mail addresses, or a combination of both?
what about dataset A? Are those responses to a single mail per person, or multiple mails? If multiple, can you identify which response belongs to which mail?

Once these questions can be answered sufficiently, we can think of ways to summarize the datasets down to create a one-to-one join, or we find that such cannot be done correctly because of missing information in the datasets.

Ideally, you post (made-up) examples for your datasets in data steps with datalines that contain all the possible combinations you find in your real data, and show us what you want to get out of these cases.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yichentian226 · Posted 09-26-2020 11:06 PM

Hi,

Thanks for your reply. I have listed the information in excel sheet to make it more clear. So sent table has all the sent email information. Open table tracked people opening the emails we sent. Complain table has the records of users complaining about our sent emails. This is my code here, it worked perfectly fine when I am only joining two tables, however giving extra wrong records when I am joining more than 2 tables...

**proc sql;
  create table result_test as
  select bounce.bouncedate,
  open.openflag,
  open.opendate,
  complain.complainflag,
  complain.complaindate,
    coalesce (sent.JobId, open.JobId, bounce.JobId, complain.JobID) as JobId
  , coalesce (sent.EmailAddress, open.EmailAddress, bounce.EmailAddress, complain.EmailAddress) as EmailAddress
  , case 
      when bounce.JobID is null or bounce.EmailAddress is null then '0'
      else bounce.bounceflag
    end as bounceflag
    ,case when open.JobID is null or open.EmailAddress is null then '0'
    else open.openflag end as openflag,
    case when complain.JobID is null or complain.EmailAddress is null then'0'
    else complain.complainflag
    end as complainflag
  from
    sent
  full join 
    bounce
  on
    sent.JobId = bounce.JobId AND
    sent.EmailAddress = bounce.EmailAddress
    full join 
    open on
    sent.JobId = open.JobId AND
    sent.EmailAddress = open.EmailAddress
    full join 
    complain on 
    sent.JobID= complain.JobID AND 
    sent.EmailAddress= complain.EmailAddress
  ;**

(this is the wrong extra record example. As you can see there shouldn't be two same complaindates..

Kurt_Bremser · Posted 09-27-2020 03:07 AM

I can't use pictures as data for testing. Please supply your example data in usable form, in data steps with datalines.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 09-26-2020 01:55 PM

It all depends on the question you're trying to answer and you knowledge of the business process and data.

We don't have that information.

@yichentian226 wrote:

Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!