Hello, I am having trouble with merging two dataset of unequal sample sizes but need to retain all variables of both datasets.
dataset a: N=1,500 and number of variables = 450
dataset b: N=150,000 and number of variables = 8
I want to do the following:
Code:
proc sort data=a;
by VAR1; run;
proc sort data=b;
by VAR1; run;
data final_data;
merge a(in=a) b;
if a;
by VAR1;
run;
Even though I was able to get the right N (1,500), my final dataset has VAR2-VAR8 variables missing from dataset b.
Is there a way to avoid this error? I need that information in my final dataset.
VAR1 | VAR2 | VAR3 | VAR4 | VAR5 | VAR6 | VAR7 | VAR8 |
101 |
|
|
|
|
|
|
|
104 |
|
|
|
|
|
|
|
115 |
|
|
|
|
|
| |
124 |
|
|
|
|
|
|
|
135 |
|
|
|
|
|
|
|
137 |
|
|
|
|
|
|
|
141 |
|
|
|
|
|
|
|
169 |
|
|
|
|
|
|
|
Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).
@lindst801 wrote:
Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).
No. It is the values of Var1. If there are no values that actually match then the added variables appear but since no matches occur they are missing.
Example:
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 22 1 2 3 33 4 6 8 ; run; data example1; merge a (in=a) b; if a; by var1; run;
This example is small enough that you can tell by eye that no Var1 values are the same in both sets.
Remove the "if a;" to see a different result.
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;
The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
Rather than hypotheticals, if you can show your full code and log that would be easier. Are there any warnings or notes in the log?
For the values to be empty it means one of two things is happening:
It's usually a variant on 1, where the data won't match because one is "A" versus "a", where the case is different.
I know you say you have numeric but given what you've stated these are the likely possibilities.
That being said, you could have not told us something that is affecting the results - therefore the request for the actual code and log to help diagnose the issue.
@lindst801 wrote:
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
@lindst801 wrote:
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
You have shown exactly 0 records of your actual data.
I will bet a short stack of $$$ that your data actual values do not match.
Or that var2-var8 are missing on the ones that do match. If the same variables appear in both sets the values in B would overwrite the ones from A. See this example:
data a; input var1 var2 var3; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 1 . . . 2 . . . ; run; data example2; merge a (in=a) b; if a; by var1; run;
Here I have forced B to have missing values. Then on the matching records in A those missing values REPLACE the values in A.
Note that the order of the data sets on the Merge statement controls this behavior. Consider:
data example3; merge b a (in=a); if a; by var1; run;
So items to consider: Which variables are in which sets, order of merge, actual values of the BY variables.
Which is why Proc contents information was requested.
This is key observation.
Are you doing a 1-to-many or many-to-many merge?
Does either of the dataset have multiple observations per value of VAR1? (If you did the counts I suggested before does the number of observations match the number of distinct values?)
When doing a 1 to many match (essentially a lookup) the variables you are adding from the "1" dataset must not exist in the "many" dataset. Otherwise only the values on the first instance of this by variable group is updated. When the second, third etc observation for that group is read from the "many" dataset then the values read for those variables will overwrite whatever was read from the "1" dataset.
If you are using the Z5. format to display the numbers and they are not actually integers then they will LOOK the same, whether or not they are the same.
You could try rounding the value to integers and see if that changes anything. Use the ROUND() function with 1 for the second argument.
I have to use the z5. format to have padding for dataset A (Example, dataset A originally had '4056' when it should be '04056').
Since the information in VAR1 are already integers (ZIP codes), I am not sure rounding to an integer value would help.
Unfortunately, there has been no documentation in my log that there was an error. I only was able to find out that it was an error by conducting a proc print check.
Here is further information from PROC CONTENTS I can provide:
Data set name | Work.a | Observations | 1500 |
Member type | Data | Variables | 450 |
Engine | V9 | Indexes | 0 |
Created | ------ | Observation length | ---- |
Last Modified | ------ | Deleted observations | 0 |
Protection |
| Compressed | NO |
Data set type |
| Sorted | YES |
Label |
|
|
|
Data Representation | ---------- |
|
|
Encoding | -------- |
|
|
Sort information | |
Sortedby | VAR1 |
Validated | YES |
Character Set | ANSI |
Dataset b:
Data set name | Work.b | Observations | 150000 |
Member type | Data | Variables | 8 |
Engine | V9 | Indexes | 0 |
Created | ------ | Observation length | ---- |
Last Modified | ------ | Deleted observations | 0 |
Protection |
| Compressed | NO |
Data set type |
| Sorted | YES |
Label |
|
|
|
Data Representation | ---------- |
|
|
Encoding | -------- |
|
|
Sort information | |
Sortedby | VAR1 |
Validated | YES |
Character Set | ANSI |
If the values are ZIP codes then you should be storing them as strings not numbers. There is no meaning to the mean of a zip code. If someone has mistakenly stored them as a number you can use the Z format with the PUT() function to generate a new variable with the strings that include leading zeros.
If you did let SAS autoconvert numbers into strings, for example with code like:
length new_var $20;
new_var=old_var;
Then SAS will right align the digit string it generates to store into the character variable.
477 data test; 478 x=1234; 479 length y $20; 480 y=x; 481 len=length(y); 482 put x= y=:$quote. len=; 483 run; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 480:5 x=1234 y=" 1234" len=20 NOTE: The data set WORK.TEST has 1 observations and 3 variables.
Check how many matches there are between your datasets to get a better sense of what your issue is.
proc sql;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from A
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from B
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from B
where VAR1 in (select var1 from A)
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from A
where VAR1 in (select var1 from B)
;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.