Solved: How does data step merge handle columns with the same name?

mcmaxwell · Posted 10-29-2020 04:30 PM

I have two datasets (SM1 and SM3) that I inner merged in SAS Studio using the following code:

/** Import the CSV file. **/ 
FILENAME CSV "/folders/myfolders/sasuser.v94/SM1 Coded Data.csv" TERMSTR=CRLF; 
PROC IMPORT DATAFILE=CSV 
		    OUT=SM1 
		    DBMS=CSV 
		    REPLACE; 
RUN; 
 
/**Keep only the variables we want for this test**/  
DATA SM1;   
	set SM1;   
	keep SUID LVL ESL URM SE1_A TV1_A IGO1_A EGO1_A CLB1_A;   
run;  
 
/** sort data **/  
proc sort data=SM1;  
by SUID;  
run;  

/** Import the CSV file. **/ 
FILENAME CSV "/folders/myfolders/sasuser.v94/SM3 Coded Data.csv" TERMSTR=CRLF; 
PROC IMPORT DATAFILE=CSV 
		    OUT=SM3 
		    DBMS=CSV 
		    REPLACE; 
RUN; 
 
/**Keep only the variables we want for this test**/  
DATA SM3;   
	set SM3;   
	keep SUID LVL ESL URM SE3_A TV3_A IGO3_A EGO3_A CLB3_A;   
run;  
 
/** sort data **/  
proc sort data=SM3;  
by SUID;  
run;  
 
/** Inner Merge Data **/  
data all;  
	merge SM1(in = left) SM3(in = right);  
	by SUID;  
	if left and right;  
	run;

As you can see...

SM1 contains the variables SUID, LVL, ESL, URM, SE1_A, TV1_A, IGO1_A, EGO1_A, and CLB1_A.

SM3 contains the variables SUID, LVL, ESL, URM, SE3_A, TV3_A, IGO3_A, EGO3_A, and CLB3_A.

In the last section of code I am asking SAS to merge the two data sets and only keep observations where the same SUID is found in both SM1 and SM2.

Here is my question: How does SAS handle other columns that have the same name (LVL, ESL, and URM)? For example, does it save the data from SM1 or SM2? What if for a certain observation, there is data in the LVL column for SM2 but not SM1 (and vice versa)? Can you tell SAS specifically how you would like it to handle these different scenarios?

Here is a picture of what the output looks like:

Any information would be greatly appreciated! Thank you in advance!

Astounding · Posted 10-29-2020 04:50 PM

Which values do you get when both data sets contain information for the same variable? The answer is actually quite complex. However, it becomes simpler if both incoming data sets have at most one observation per SUID. In that case, since you are taking matches only and not mismatches, you get the value from the SM3 data set (including a missing value of there is no data in SM3 for that variable).

For all variables, to say, "Use the value from SM3. But if SM3 has a missing value, use the value from SM1." you can ask for that by replacing the MERGE statement:

update SM1(in = left) SM3(in = right);

For all variables, to say, "Use the value from SM1. But if SM1 has a missing value, use the value from SM3." you can ask for that again by replacing the MERGE statement:

update SM3(in = left) SM1(in = right);

Again, for any of these to be reliable, you must be sure that a SUID never appears more than once in the same data set.

View solution in original post

Astounding · Posted 10-29-2020 04:50 PM

Which values do you get when both data sets contain information for the same variable? The answer is actually quite complex. However, it becomes simpler if both incoming data sets have at most one observation per SUID. In that case, since you are taking matches only and not mismatches, you get the value from the SM3 data set (including a missing value of there is no data in SM3 for that variable).

For all variables, to say, "Use the value from SM3. But if SM3 has a missing value, use the value from SM1." you can ask for that by replacing the MERGE statement:

update SM1(in = left) SM3(in = right);

For all variables, to say, "Use the value from SM1. But if SM1 has a missing value, use the value from SM3." you can ask for that again by replacing the MERGE statement:

update SM3(in = left) SM1(in = right);

Again, for any of these to be reliable, you must be sure that a SUID never appears more than once in the same data set.

Tom · Posted 10-29-2020 05:04 PM

The last one read from "wins". In one to one merge this is the last dataset listed in the MERGE statement that has the variable. In Many to Many (of which One to Many is a simplified case) it is a little more complicated if the a dataset listed earlier in the MERGE statement is contributing more observations to the BY group then the values read from it after the other dataset(s) run out of observations is what will win.

Note also that attributes of the variable are determined by how they are first seen. So the type and length of the variable is determined by the first dataset with the variable. FORMAT, INFORMAT and LABEL are determined by the value in the first dataset that actually has a value for that attribute.

How does data step merge handle columns with the same name?

Re: How does data step merge handle columns with the same name?

Re: How does data step merge handle columns with the same name?

Re: How does data step merge handle columns with the same name?