BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shayan2012
Quartz | Level 8

Hi All,

 

So, I have a kind of weired question that made me question my knowledge on SQL joins.

 

I have a dataset lets say HAVE, which I perform a left join on with another dataset, DATA like this:

 

proc sql;
create table WANT as
select h.*, d.x 

from have as h left join data as d

on h.id = d.id

quit;

 

I would have supposed that all of HAVE observations will be there in my WANT dataset. However, after the merge, I noticed that a few are missing. (the freq was 37 out of 20000 were missing.)

 

I redid the above code with a full join; still did not found them.

 

Then I just chose those disappearing ovservations, and tried to left join with the DATA, and I got zero.

 

Is this like left join not working perfectly? What can be causig this problem?

 

I really appreciate your helps since I am not that experienced in sql;

 

thanks;

1 ACCEPTED SOLUTION

Accepted Solutions
Shayan2012
Quartz | Level 8

UPDATE: Ok, I think I got it. Apparently the problem was coming from the fact that I was working with the options obs = 20000. at the beginning of my code. Apparently, the result will not be in a sorted way, so when truncating to 20000 options I will not see the other observations which incudes the specific observation I am looking for.

 

View solution in original post

9 REPLIES 9
Reeza
Super User

You don't have a WHERE on your query? 

 

Shayan2012
Quartz | Level 8
not really. Just the basic above code.
ballardw
Super User

Run proc freq on each data set for the variable ID. Look for different counts.

 

If you get different counts for levles of ID then you have a place to look for other data issues.

SASKiwi
PROC Star

Is your ID variable numeric by any chance? One possible explanation for missing rows is precision problems with numeric keys, which can only handle a maximum of about 16 digits accurately.

Shayan2012
Quartz | Level 8
thanks a lot,
It is numeric actually, but I not more than 12 digits. I should look into that anyways.
Steelers_In_DC
Barite | Level 11

I would use strip() to start on those missing records.  See why they aren't coming up.  You can try upcase() as well, something along those lines is the issue.  I assume it's the way the data was entered.

FreelanceReinh
Jade | Level 19

Hi @Shayan2012,

 

This sounds like a really weird situation and I'm curious to know the solution.

 

So, you are really working with SAS datasets? Or are there any views involved?

 

As @SASKiwi has mentioned, numeric precision problems can lead to non-matching IDs which actually should be equal. However, this would lead to unexpected missing values of variable X (from dataset DATA), but not to a loss of observations from the left table in a left join. (At least I wouldn't know how.)

 

Sometimes things are different from what they seem.

Example 1: It's possible to obtain the following log:

593  data _null_;
594  if 0 then set have nobs=n;
595  put n;
596  stop;
597  run;

20000
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


598  proc sql;
599  create table want as
600  select h.*, d.x
601  from have as h left join data as d
602  on h.id = d.id;
NOTE: Table WORK.WANT created, with 19963 rows and 2 columns.

603  quit;

Explanation: 37 observations had been marked as deleted (by a PROC SQL DELETE statement).

 

Example 2: With (unexpected) duplicate IDs in the second dataset, the result of a left join can convey the impression that the first dataset has more observations than it actually contains. So, it may seem surprising when the number of observations drops, once the duplicates have been eliminated in the second table. (However, this could not explain the situation of zero observations in the result, which you have described.)

 

Example 3: You are telling us that your numeric keys have "not more than 12 digits." Well, provided that we are not talking about integers, but about numbers with decimal places, how would you know how many digits they have? Formatted output using default format or any standard numeric format is deceiving. Also, no less than 99.84% of the numbers with up to 4 decimal places have no exact internal representation (on Windows systems). Virtually any given formatted number can have several different internal representations in SAS. And this includes (numbers which look like) integers.

 

Suggestion: Please share PROC CONTENTS output of that dataset with 37 observations and the log showing how a left join with this dataset as the left table resulted in "0 rows."

Steelers_In_DC
Barite | Level 11

I see the suggestion about proc freq, that's good.  If it's numbers try round().  I still think it's a problem with the way the data was entered.  ID numbers should be character as you are never going to add them.

 

Mark

Shayan2012
Quartz | Level 8

UPDATE: Ok, I think I got it. Apparently the problem was coming from the fact that I was working with the options obs = 20000. at the beginning of my code. Apparently, the result will not be in a sorted way, so when truncating to 20000 options I will not see the other observations which incudes the specific observation I am looking for.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1794 views
  • 5 likes
  • 6 in conversation