Hi, below is my query and on the screenshot it's how the two datasets look like and what I want to get.
proc sql;
create table withbounceclick as
select e.*,
b.URL, b.LinkName, b.clickflag, b.ClickDate
from email.withbounce_new1 e
left join email.EnterpriseDataViewClick b on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId);
QUIT;
However, this is giving me more rolls comparing to dataset: email.withbounce_new1... this is confusing to me because I was expecting to see same number of rolls in dataset: withbounceclick and those in withbounce_new1. Thanks in advance!
Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much
@yichentian226 wrote:
Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much
So getting multiple records means that you have duplicates of both variables in combination.
Not efficient but you can try
Select Distinct e.*
which should yield non-dupllcated rows. If you expect to only have one email and jobid combination in the output then you may need to provide more rules of how to determine which is supposed to be the single output record.
Thanks! So my dataset is kinda like that, let's say dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddressa and sentdate. Dataset B has more records because it has all the emails we sent. And of course many duplicate jobid and emailaddress. I am not sure if I can use distinct in this case?
@yichentian226 wrote:
Yes I do have duplicate emailaddress and jobid, I thought on (b.EmailAddress = e.EmailAddress and b.JobId=e.JobId) could prevent but apparently it didn't. Can you please share the right condition I should write?Thanks very much
That requires knowledge of the data, and we don't have that. That's part of your requirements.
If you have other variables in your data set the distinct won't help either because you may not have exact duplicates. This is a logical business decision, not a programming question but one YOU need to answer. Is there another field that uniquely identifies each row that is common? Does it need to be based on some date range - very common in medical and insurance data? If the other fields don't matter you can do a distinct but if they do then you need to make your business decisions. And not all variables may be treated the same - some you may need to summarize with a sum, but others may need to use a mean or median.
This is how the data looks like: dataset A has jobid, emailaddress.clickflag and clickdate, datasetB has jobid, emailaddress and sentdate. Dataset B has more records because it has all the emails we sent. So here it definitely makes sense the duplicate you mentioned, but I'm thinking if we can make use of the date to avoid duplicate?
Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened.
Don't think about tools yet. You have to clarify other things first.
Once these questions can be answered sufficiently, we can think of ways to summarize the datasets down to create a one-to-one join, or we find that such cannot be done correctly because of missing information in the datasets.
Ideally, you post (made-up) examples for your datasets in data steps with datalines that contain all the possible combinations you find in your real data, and show us what you want to get out of these cases.
Hi,
Thanks for your reply. I have listed the information in excel sheet to make it more clear. So sent table has all the sent email information. Open table tracked people opening the emails we sent. Complain table has the records of users complaining about our sent emails. This is my code here, it worked perfectly fine when I am only joining two tables, however giving extra wrong records when I am joining more than 2 tables...
**proc sql;
create table result_test as
select bounce.bouncedate,
open.openflag,
open.opendate,
complain.complainflag,
complain.complaindate,
coalesce (sent.JobId, open.JobId, bounce.JobId, complain.JobID) as JobId
, coalesce (sent.EmailAddress, open.EmailAddress, bounce.EmailAddress, complain.EmailAddress) as EmailAddress
, case
when bounce.JobID is null or bounce.EmailAddress is null then '0'
else bounce.bounceflag
end as bounceflag
,case when open.JobID is null or open.EmailAddress is null then '0'
else open.openflag end as openflag,
case when complain.JobID is null or complain.EmailAddress is null then'0'
else complain.complainflag
end as complainflag
from
sent
full join
bounce
on
sent.JobId = bounce.JobId AND
sent.EmailAddress = bounce.EmailAddress
full join
open on
sent.JobId = open.JobId AND
sent.EmailAddress = open.EmailAddress
full join
complain on
sent.JobID= complain.JobID AND
sent.EmailAddress= complain.EmailAddress
;**
(this is the wrong extra record example. As you can see there shouldn't be two same complaindates..
I can't use pictures as data for testing. Please supply your example data in usable form, in data steps with datalines.
It all depends on the question you're trying to answer and you knowledge of the business process and data.
We don't have that information.
@yichentian226 wrote:
Thanks those are great ideas, however I do need to bring some of those values in.. according to the requirements. I guess my biggest question here is one sent email might have two opens depending on when users opened it, how many times they opened.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.