- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I am having trouble with merging two dataset of unequal sample sizes but need to retain all variables of both datasets.
dataset a: N=1,500 and number of variables = 450
dataset b: N=150,000 and number of variables = 8
I want to do the following:
- Do a one to one merge by a variable (VAR1)
- I want all the information in dataset b (VAR2-VAR8) to merge with dataset a (Final N=1,500)
Code:
proc sort data=a;
by VAR1; run;
proc sort data=b;
by VAR1; run;
data final_data;
merge a(in=a) b;
if a;
by VAR1;
run;
Even though I was able to get the right N (1,500), my final dataset has VAR2-VAR8 variables missing from dataset b.
Is there a way to avoid this error? I need that information in my final dataset.
VAR1 | VAR2 | VAR3 | VAR4 | VAR5 | VAR6 | VAR7 | VAR8 |
101 |
|
|
|
|
|
|
|
104 |
|
|
|
|
|
|
|
115 |
|
|
|
|
|
| |
124 |
|
|
|
|
|
|
|
135 |
|
|
|
|
|
|
|
137 |
|
|
|
|
|
|
|
141 |
|
|
|
|
|
|
|
169 |
|
|
|
|
|
|
|
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@lindst801 wrote:
Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).
No. It is the values of Var1. If there are no values that actually match then the added variables appear but since no matches occur they are missing.
Example:
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 22 1 2 3 33 4 6 8 ; run; data example1; merge a (in=a) b; if a; by var1; run;
This example is small enough that you can tell by eye that no Var1 values are the same in both sets.
Remove the "if a;" to see a different result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;
The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Rather than hypotheticals, if you can show your full code and log that would be easier. Are there any warnings or notes in the log?
For the values to be empty it means one of two things is happening:
- There is no match found in the other table so they get assigned empty
- The match is found on VAR1 but all the variables in the data set are empty for some reason (data collection issues?)
It's usually a variant on 1, where the data won't match because one is "A" versus "a", where the case is different.
I know you say you have numeric but given what you've stated these are the likely possibilities.
That being said, you could have not told us something that is affecting the results - therefore the request for the actual code and log to help diagnose the issue.
@lindst801 wrote:
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@lindst801 wrote:
That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a; input var1 varb varc; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 2 1 2 3 3 4 6 8 ; run;The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).
Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.
You have shown exactly 0 records of your actual data.
I will bet a short stack of $$$ that your data actual values do not match.
Or that var2-var8 are missing on the ones that do match. If the same variables appear in both sets the values in B would overwrite the ones from A. See this example:
data a; input var1 var2 var3; datalines; 1 1 2 3 2 2 4 6 ; data b; input var1 var2 var3; datalines; 1 . . . 2 . . . ; run; data example2; merge a (in=a) b; if a; by var1; run;
Here I have forced B to have missing values. Then on the matching records in A those missing values REPLACE the values in A.
Note that the order of the data sets on the Merge statement controls this behavior. Consider:
data example3; merge b a (in=a); if a; by var1; run;
So items to consider: Which variables are in which sets, order of merge, actual values of the BY variables.
Which is why Proc contents information was requested.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is key observation.
Are you doing a 1-to-many or many-to-many merge?
Does either of the dataset have multiple observations per value of VAR1? (If you did the counts I suggested before does the number of observations match the number of distinct values?)
When doing a 1 to many match (essentially a lookup) the variables you are adding from the "1" dataset must not exist in the "many" dataset. Otherwise only the values on the first instance of this by variable group is updated. When the second, third etc observation for that group is read from the "many" dataset then the values read for those variables will overwrite whatever was read from the "1" dataset.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you are using the Z5. format to display the numbers and they are not actually integers then they will LOOK the same, whether or not they are the same.
You could try rounding the value to integers and see if that changes anything. Use the ROUND() function with 1 for the second argument.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have to use the z5. format to have padding for dataset A (Example, dataset A originally had '4056' when it should be '04056').
Since the information in VAR1 are already integers (ZIP codes), I am not sure rounding to an integer value would help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And the log would have had errors if that was the case as well.
Please post the proc contents from the data sets you're trying to merge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, there has been no documentation in my log that there was an error. I only was able to find out that it was an error by conducting a proc print check.
Here is further information from PROC CONTENTS I can provide:
Data set name | Work.a | Observations | 1500 |
Member type | Data | Variables | 450 |
Engine | V9 | Indexes | 0 |
Created | ------ | Observation length | ---- |
Last Modified | ------ | Deleted observations | 0 |
Protection |
| Compressed | NO |
Data set type |
| Sorted | YES |
Label |
|
|
|
Data Representation | ---------- |
|
|
Encoding | -------- |
|
|
Sort information | |
Sortedby | VAR1 |
Validated | YES |
Character Set | ANSI |
Dataset b:
Data set name | Work.b | Observations | 150000 |
Member type | Data | Variables | 8 |
Engine | V9 | Indexes | 0 |
Created | ------ | Observation length | ---- |
Last Modified | ------ | Deleted observations | 0 |
Protection |
| Compressed | NO |
Data set type |
| Sorted | YES |
Label |
|
|
|
Data Representation | ---------- |
|
|
Encoding | -------- |
|
|
Sort information | |
Sortedby | VAR1 |
Validated | YES |
Character Set | ANSI |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the values are ZIP codes then you should be storing them as strings not numbers. There is no meaning to the mean of a zip code. If someone has mistakenly stored them as a number you can use the Z format with the PUT() function to generate a new variable with the strings that include leading zeros.
If you did let SAS autoconvert numbers into strings, for example with code like:
length new_var $20;
new_var=old_var;
Then SAS will right align the digit string it generates to store into the character variable.
477 data test; 478 x=1234; 479 length y $20; 480 y=x; 481 len=length(y); 482 put x= y=:$quote. len=; 483 run; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 480:5 x=1234 y=" 1234" len=20 NOTE: The data set WORK.TEST has 1 observations and 3 variables.
Check how many matches there are between your datasets to get a better sense of what your issue is.
proc sql;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from A
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from B
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from B
where VAR1 in (select var1 from A)
;
select count(*) as nobs
, count(var1) as n_nonmissing
, count(distinct var1) as n_values
from A
where VAR1 in (select var1 from B)
;