BookmarkSubscribeRSS Feed
yichentian226
Obsidian | Level 7

Capture.PNG

 

Hi, below is my query and on the screenshot it's how the two datasets look like and what I want to get. 

 

proc sql;
create table withbounceclick as
select e.*,
b.URL, b.LinkName, b.clickflag, b.ClickDate
from email.withbounce_new1 e
left join email.EnterpriseDataViewClick b on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId);
QUIT;

 

However, this is giving me more rolls comparing to dataset: email.withbounce_new1... this is confusing to me because I was expecting to see same number of rolls in dataset: withbounceclick and those in withbounce_new1. Thanks in advance!

13 REPLIES 13
Reeza
Super User
You have duplicates of Email Address and/or Job ID in your data set. If you have 2 emails in one and 3 of the same email in the other data set, it will join with all, giving you 3 x 2 = 6 rows in the output. Generally, the solution is to modify your join condition.
yichentian226
Obsidian | Level 7

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much

ballardw
Super User

@yichentian226 wrote:

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much


So getting multiple records means that you have duplicates of both variables in combination.

 

Not efficient but you can try

Select Distinct e.*

 

which should yield non-dupllcated rows. If you expect to only have one email and jobid combination in the output then you may need to provide more rules of how to determine which is supposed to be the single output record.

 

yichentian226
Obsidian | Level 7

Thanks! So my dataset is kinda like that, let's say dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddressa and sentdate. Dataset B has more records because it has all the emails we sent. And of course many duplicate jobid and emailaddress. I am not sure if I can use distinct in this case?  

Reeza
Super User

@yichentian226 wrote:

Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much


That requires knowledge of the data, and we don't have that. That's part of your requirements. 

If you have other variables in your data set the distinct won't help either because you may not have exact duplicates. This is a logical business decision, not a programming question but one YOU need to answer. Is there another field that uniquely identifies each row that is common? Does it need to be based on some date range - very common in medical and insurance data? If the other fields don't matter you can do a distinct but if they do then you need to make your business decisions. And not all variables may be treated the same - some you may need to summarize with a sum, but others may need to use a mean or median. 

yichentian226
Obsidian | Level 7

This is how the data looks like: dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddress and sentdate. Dataset B has more records because it has all the emails we sent. So here it definitely makes sense the duplicate you mentioned, but I'm thinking if we can make use of the date to avoid duplicate?

Reeza
Super User
Data set A - I'm assuming this is the click data for your email campaigns.

Do you care if someone clicked on something more than once? Does it really affect things?
I would consider simplifying that file to first open/click date so you have only one record per JobID and emailAddress. I would simplify it to first date opened, last date opened and number of opens/reads.

I'm assuming dataset B has a unique records entirely - if it doesn't that's a different problem - why do you allow people to have multiple email addresses connected to multiple accounts?

Then I'd join my simplified file.
yichentian226
Obsidian | Level 7

Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened. Capture.PNG

yichentian226
Obsidian | Level 7


I am thinking maybe I should do a left outer join?
Kurt_Bremser
Super User

Don't think about tools yet. You have to clarify other things first.

  • what do the multiple observations in B show? Are those different emails sent to the same mail address, or the same mail sent to one person under several mail addresses, or a combination of both?
  • what about dataset A? Are those responses to a single mail per person, or multiple mails? If multiple, can you identify which response belongs to which mail?

Once these questions can be answered sufficiently, we can think of ways to summarize the datasets down to create a one-to-one join, or we find that such cannot be done correctly because of missing information in the datasets.

 

Ideally, you post (made-up) examples for your datasets in data steps with datalines that contain all the possible combinations you find in your real data, and show us what you want to get out of these cases.

yichentian226
Obsidian | Level 7

Hi,

Thanks for your reply. I have listed the information in excel sheet to make it more clear. So sent table has all the sent email information. Open table tracked people opening the emails we sent. Complain table has the records of users complaining about our sent emails. This is my code here, it worked perfectly fine when I am only joining two tables, however giving extra wrong records when I am joining more than 2 tables... 

**proc sql;
  create table result_test as
  select bounce.bouncedate,
  open.openflag,
  open.opendate,
  complain.complainflag,
  complain.complaindate,
    coalesce (sent.JobId, open.JobId, bounce.JobId, complain.JobID) as JobId
  , coalesce (sent.EmailAddress, open.EmailAddress, bounce.EmailAddress, complain.EmailAddress) as EmailAddress
  , case 
      when bounce.JobID is null or bounce.EmailAddress is null then '0'
      else bounce.bounceflag
    end as bounceflag
    ,case when open.JobID is null or open.EmailAddress is null then '0'
    else open.openflag end as openflag,
    case when complain.JobID is null or complain.EmailAddress is null then'0'
    else complain.complainflag
    end as complainflag
  from
    sent
  full join 
    bounce
  on
    sent.JobId = bounce.JobId AND
    sent.EmailAddress = bounce.EmailAddress
    full join 
    open on
    sent.JobId = open.JobId AND
    sent.EmailAddress = open.EmailAddress
    full join 
    complain on 
    sent.JobID= complain.JobID AND 
    sent.EmailAddress= complain.EmailAddress
  ;**

 Capture.PNG

2.PNG(this is the wrong extra record example. As you can see there shouldn't be two same complaindates.. 

Reeza
Super User

It all depends on the question you're trying to answer and you knowledge of the business process and data. 

 

We don't have that information. 

 


@yichentian226 wrote:

Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened. Capture.PNG


 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1119 views
  • 1 like
  • 4 in conversation