Have been putting out fires that resulted from using a Proc SQL join instead of a SAS data step merge. I'm wondering if I have found a bug in SAS PROC SQL, as this merge did not behave as I would have expected, and did not behave as it would have if I was writing SQL code in SSMS.
Example offending code:
proc sql; create table fup_timing as select a.*, t.Procedure_Date as t_Proc_Date label = "Followup Timing: Date Procedure from Proc Form", t.Schedule as t_ScheduleCat label = "Followup Timing: Standard vs. Specialized", from followup as a left join followup_timing as t on a.FormID_FUP= t.FormID_FUP; quit;
There are some rows with missing FORMID_FUP in the followup data by design. Because there are also missing FORMID_FUPs in followup_timing, we are getting an unexpected merge where N rows from followup_timing are being pulled over and merged on missing id with followup. This is causing duplication of the rows in followup that are missing the ID, as well as erroneous data being merged in from followup_timing for those rows.
By doing a simple SAS merge, I get the behavior I expect where I have a left side merge that only pulls over valid values for FORMID_FUP that are in both followup and followup_timing
data followup_test; set followup; run; data fuptim; set followup_timing; run; proc sort data=followup_test; by FORMID_FUP; run; proc sort data=fuptim; by FORMID_FUP; run; data rslt; merge followup_test (in=a) fuptim (in=b); by FORMID_FUP; if a; run;
I work with some talented web developers who write more SQL than I do, and even their suggestion of turning the PROC SQL join into an inner join unexpectedly merges on the missing values and results in duplication of rows.
Is this a bug? Or is this some fundamental misunderstanding on our part of how SAS handles missing KEY_ID values?
Without a representative portion of the actual data used, it is unlikely that anyone can figure this out. Please provide a representative portion of the actual data as working SAS data step code (examples and instructions), and NOT in any other form. Excel files and copy and paste from Excel are NOT acceptable. If the data is confidential, then you can manually change the data as needed.
Your left join is generating a cartesian cross of instances of FORMID_FUP=. in both datasets if you have multiple instances of missing value in the LEFT dataset. MERGE ... BY, on the other hand, does not do this.
And since you suggest, in the case of NON-missing FORMID_FUP, that using MERGE ...; BY FORMID_FUP; yields what you want (and what you get) from the PROC SQL ... LEFT JOIN, then it must be that the LEFT dataset has exactly one record per non-missing FORMID_FUP value.
My question is why do you want cases with missing FORMID_FUP? Why not exclude those cases from the join? ... As in (see the "where=" dataset name parameters below):
proc sql;
		create table fup_timing as
		select a.*,
				t.Procedure_Date as t_Proc_Date label = "Followup Timing:  Date Procedure from Proc Form",
				t.Schedule as t_ScheduleCat label = "Followup Timing: Standard vs. Specialized",
		from followup (where=(formid_fup^=.))             as a
		left join followup_timing (where=(formid_fup^=.)) as t
		on a.FormID_FUP= t.FormID_FUP;
	quit; 
There are baseline rows concatenated with the followup data before this merge happens. The solution could be to just merge the timing in before the concatenation.
My solution was to just add a datastep before this and delete and rows with a missing FORMID_FUP.
The issue I have is that I have never seen, even in SQL, that a null value is merged to a null value during a join. It made me wonder is PROC SQL unique enough that this is expected, or is this unexpected behavior in how SQL is implemented in SAS?
In SAS, a missing value is still a value, and so it is also used as a "key" in a MERGE or a JOIN.
Going to have to disagree with you there, Kurt. If you code this to use: data want; merge followup (in=a) followup_timing (in=b); by FORMID_FUP; if a; run;
SAS does NOT merge based on the missing values of FORMID_FUP and the resulting want dataset does not include duplicated rows as a product of the missing ID.
@elmoreoocyte wrote:
Going to have to disagree with you there, Kurt. If you code this to use: data want; merge followup (in=a) followup_timing (in=b); by FORMID_FUP; if a; run;
SAS does NOT merge based on the missing values of FORMID_FUP and the resulting want dataset does not include duplicated rows as a product of the missing ID.
Can you explain what you mean by that? Because SAS absolutely does consider two missing values as equal and so will merge them. When one dataset has more observations for a given by level then the variables unique to the other dataset retain the values from the last observation read for the by level.
Try it yourself with an example.
missing az;
data single ;
  input x @@;
  a+1;
cards;
. A Z 1 100
;
data multiple;
  input x @@;
  b+1;
cards;
. . . Z 1 1 10 100
;
data both;
  merge single(in=in1) multiple(in=in2);
  by x;
  single=in1;
  multiple=in2;
run;
I tried it myself with the data I'm working with. Doing a left merge with SAS I had the 25k results I expected, no duplicated rows. Using proc sql join, SAS merges on the missing ID and duplicates rows.
@elmoreoocyte wrote:
I tried it myself with the data I'm working with. Doing a left merge with SAS I had the 25k results I expected, no duplicated rows. Using proc sql join, SAS merges on the missing ID and duplicates rows.
So that is what everyone is telling you to expect when doing an N to M merge.
With an SQL join each of the N observations from the left dataset is merged with each of the M observations from the right dataset so you get N x M observation. But In a data step merge the observations are instead matched in the order they appear so you will get the max(N ,M) observations.
If your output dataset has the same number of observations as the "left" dataset then it had the larger number of observations for each BY group level than the "right" dataset.
If the missing values are confusing you try an example with non-missing values to see the difference between how SQL and MERGE handle by groups with multiple observations.
I think something is being lost here. I'm not confused about what you're talking about, I've been programming in SAS for over a decade.
The thing that got me was that the SQL join in SAS does not behave the way it does in SQL. You said that a merge would not ignore the missing values in SAS, and I replied that it indeed DOES ignore those missing values, while a JOIN does not ignore them as I would expect from experience programming SQL in SQL environments.
The answer to the question came when someone said SAS doesn't use the same TRI level logic of other systems. So it's not a bug like I feared, it's just how SAS handles SQL joins.
As I said before it is a difference in what you consider a match.
For some reason most other systems that use SQL as their query language (mainly developed long after SAS was developed) have decided to extent the concept of missing data useful for statistical analysis into the realm of comparisons. SAS did not do that. In SAS comparisons are either true or false.
Inequality tests will also cause confusion when converting SQL code. SAS uses only two level logic for inequality tests. In SAS missing numeric values are smaller than any actual number. And there is an order among the special missing values.
And for character variables since SAS stores character strings in ASCII (now extended to UTF-8) the all blank value that is treated as missing for most things is also smaller than most strings. That is because the ASCII code for a space is '20'x and the smaller codes from '00'x to '1F'x are reserved fir control characters like tabs and linefeeds that do not normally appear in text.
@elmoreoocyte wrote:
I tried it myself with the data I'm working with. Doing a left merge with SAS I had the 25k results I expected, no duplicated rows. Using proc sql join, SAS merges on the missing ID and duplicates rows.
Which is what happens in a m:n situation. The DATA step MERGE will result in max(m,n) observations, the SQL JOIN in m*n observations. In both cases, missing values are used as valid keys.
Not only is a missing value still a value that can be used as a key, as @Kurt_Bremser says, but there are many other possible missing values. There are the 26 values of .A,.B, .... .Z, and also ._ (dot underscore). These can be used for special purposes, if you want. I don't think users would want SAS to automatically assume that such values should be ignored by default, as if they "have no value" in database tasks, even if they are "ignored" in statistical analysis.
@elmoreoocyte wrote:
...
The issue I have is that I have never seen, even in SQL, that a null value is merged to a null value during a join. It made me wonder is PROC SQL unique enough that this is expected, or is this unexpected behavior in how SQL is implemented in SAS?
That is your mistake there.
SAS does NOT implement the TRI level logic that is used in a lot of database systems. Comparison operators always return either TRUE or FALSE, never NULL.
In SAS two blank character values are equal. And numeric missing values are also equal. Although special missing values only match the same special missing value.
>My solution was to just add a datastep before this and delete and rows with a missing FORMID_FUP.
Why not remove these values during the join? Add a where clause on the left table.
> The issue I have is that I have never seen, even in SQL, that a null value is merged to a null value during a join.
A missing value is not a null value, as others have said. There are no null values in SAS.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
