BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sasgorilla
Pyrite | Level 9

I am using PROC SQL to try to bring some additional variables from each data1 (var1-5) observation into a new dataset when there is a match with a data2 observation across var1-3.

 

(For my real world data, data2 is the entire sample of subjects and data1 is a small percentage of the sample who had some additional measures. There is no common ID to join by, so I am trying to join where 6 demographic variables are the same in both datasets.)

 

Goals: 

1. Join on a one-for-one basis (i.e. one data1 observation is joined with one data2 observation, and var4-5 from data1 are brought into new dataset).

2. Retain all data2 observations where there was no match, with missing variables for var4-5.

3. Know when/which data1 observations did not have a match, and were thus not included in the new data set.

 

Problems: 

1. My current code is bringing data1 var4-5 to ALL data2 observations where var1-3 match. Is a left join the wrong operation altogether, or are my x=y conditions able to be modified to restrict to one match? 

2. This works with my current code. 

3. Haven't figured out a clean way to identify this, so not included in my code. 

 

Sample code below (SAS 9.4): 

data data1;
input var1 var2 $ var3 var4 var5;
cards;
1 A 1 1 1
1 B 2 2 2
1 C 3 3 3
1 D 4 4 4
;
run;

 

data data2;
input var1 var2 $ var3;
cards;
1 A 1
1 A 0
1 B 2
1 B 2
1 C 3
1 C 4
;
run;

 

proc sql noprint;
create table new as
select * from
data2 x left join data1 y
on x.var1 = y.var1 and x.var2 = y.var2 and x.var3 = y.var3;
quit;

 

New Dataset:

var1 var2 var3 var4 var5

1 A 0 . . 

1 A 1 1 1

1 B 2 2 2

1 B 2 2 2  *This is what I am trying to eliminate. If no other matches in data1 this line should be "1 B 2 . ."*

1 C 3 3 3 

1 C 4 . . 

 

*I would like to be able to see that observation 4 from data1 was not included in the new dataset (1 D 4 4 4).*

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

SAS provides tools that will tell you which data sets contribute to the current observation. The IN= data set option creates a temporary variable of the name you choose that will have values of 1, the observation comes from this data set, or 0. To require observations that come from data2, whether there is a match in data 1 use the example below.

 

data merged;
merge data2 (in=indata2) data1 (in=indata1);
by var1 var2 var3 var4 var5 var6;
if not (first.var6) then call missing (var7, var8, var9, var10);
if indata2;
run;

This is a sub-setting if. An IF without a then selects the observations where the statement is true for inclusion in the result data set.

If you require observations to come from both data sets use : if indata2 and indata1;

If you want the ones from data2 that do not have a match: if indata2 and not (indata1);

 

 

View solution in original post

4 REPLIES 4
ballardw
Super User

"Restrict to first" is not really an SQL concept. The underlying idea for SQL is set operations and order isn't considered.

 

For your example a data step approach works:

data data1;
input var1 var2 $ var3 var4 var5;
cards;
1 A 1 1 1
1 B 2 2 2
1 C 3 3 3
1 D 4 4 4
;
run;

data data2;
input var1 var2 $ var3;
cards;
1 A 1
1 A 0
1 B 2
1 B 2
1 C 3
1 C 4
;
run;

proc sort data=data2;
  by var1 var2 var3;
run;

data want;
   merge data2 data1;
   by var1 var2 var3;
   if not (first.var3) then call missing(var4, var5);
run;

The proc sort is needed to use the BY Var1 var2 var3 in the code for want. I suspect the option that might allow this to work without the sort using the NOTSORTED option may have issues with your real data.

The data step, NOT proc sql, creates automatic variables for each variable on the BY statement to indicate whether the value is the first of the by group. You access those using First.variablename or Last.variablename. The values are numeric 1/0 and are used by SAS as True/False.

The merge will combine the values so we can tell SAS to reset the values to missing for those not on the actual match if the match is not the first one.

Call Missing is a special function whose sole purpose is to set a list of variables to missing. As such it doesn't care whether the variables of any particular type and you can mix types if needed.

 

A secondary though similar approach would be to add an order value to the Data2 set and testing the value of the order variable after the combination. This may be workable with SQL but would require multiple CASE statements, on for each variable to reset to missing the other values.

sasgorilla
Pyrite | Level 9

@ballardw Thank you for the clarity on PROC SQL, and the data step suggestion!

 

In applying the data step code to my real data, it appears to have worked when the variables of interest matched between data1 and data2, bringing in the additional variables from data1 and assigning missing values on unmatched observations from data2.

 

However, the new data set also includes the observations from data1 where there was no match in data2, thus increasing the size of my new data set to be greater than data2 (100% sample).  

 

I'm trying to figure out why this is. 

 

The code I used after sorting each dataset by var1-6 to be matched by: 

 

data merged;
merge data2 data1;
by var1 var2 var3 var4 var5 var6;
if not (first.var6) then call missing (var7, var8, var9, var10);
run;

ballardw
Super User

SAS provides tools that will tell you which data sets contribute to the current observation. The IN= data set option creates a temporary variable of the name you choose that will have values of 1, the observation comes from this data set, or 0. To require observations that come from data2, whether there is a match in data 1 use the example below.

 

data merged;
merge data2 (in=indata2) data1 (in=indata1);
by var1 var2 var3 var4 var5 var6;
if not (first.var6) then call missing (var7, var8, var9, var10);
if indata2;
run;

This is a sub-setting if. An IF without a then selects the observations where the statement is true for inclusion in the result data set.

If you require observations to come from both data sets use : if indata2 and indata1;

If you want the ones from data2 that do not have a match: if indata2 and not (indata1);

 

 

sasgorilla
Pyrite | Level 9

@ballardw Thank you for your help with this. I appreciate it!

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1457 views
  • 2 likes
  • 2 in conversation