Re: Issue with data merge, losing information from one dataset that I ...

lindst801 · Posted 06-23-2021 03:18 PM

Hello, I am having trouble with merging two dataset of unequal sample sizes but need to retain all variables of both datasets.

dataset a: N=1,500 and number of variables = 450

dataset b: N=150,000 and number of variables = 8

I want to do the following:

Do a one to one merge by a variable (VAR1)
I want all the information in dataset b (VAR2-VAR8) to merge with dataset a (Final N=1,500)

Code:

proc sort data=a;
by VAR1; run;
proc sort data=b;
by VAR1; run;

data final_data;
merge a(in=a) b;
if a;
by VAR1;
run;

Even though I was able to get the right N (1,500), my final dataset has VAR2-VAR8 variables missing from dataset b.
Is there a way to avoid this error? I need that information in my final dataset.

VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	VAR8
101
104
115
124
135
137
141
169

data_null__ · Posted 06-23-2021 03:25 PM

I'm guessing the BY variable VALUES in A and B are different even though they may LOOK the same. Perhaps leading spaces if character. Also I usually put the sub-setting IF after the BY but I don't' think that is the problem.

lindst801 · Posted 06-23-2021 03:30 PM

Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).

ballardw · Posted 06-23-2021 03:51 PM

@lindst801 wrote:

Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.).

No. It is the values of Var1. If there are no values that actually match then the added variables appear but since no matches occur they are missing.

Example:

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
22  1 2 3
33  4 6 8
;
run;

data example1;
  merge a (in=a)  b;
  if a;
  by var1;
run;

This example is small enough that you can tell by eye that no Var1 values are the same in both sets.

Remove the "if a;" to see a different result.

lindst801 · Posted 06-23-2021 03:57 PM

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;

The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.

Reeza · Posted 06-23-2021 04:03 PM

Rather than hypotheticals, if you can show your full code and log that would be easier. Are there any warnings or notes in the log?

For the values to be empty it means one of two things is happening:

There is no match found in the other table so they get assigned empty
The match is found on VAR1 but all the variables in the data set are empty for some reason (data collection issues?)

It's usually a variant on 1, where the data won't match because one is "A" versus "a", where the case is different.

I know you say you have numeric but given what you've stated these are the likely possibilities.

That being said, you could have not told us something that is affecting the results - therefore the request for the actual code and log to help diagnose the issue.

@lindst801 wrote:

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;
The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.

ballardw · Posted 06-23-2021 04:36 PM

@lindst801 wrote:

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example)
data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;
The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well.

You have shown exactly 0 records of your actual data.

I will bet a short stack of $$$ that your data actual values do not match.

Or that var2-var8 are missing on the ones that do match. If the same variables appear in both sets the values in B would overwrite the ones from A. See this example:

data a;
   input var1 var2 var3;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
1 . . .
2 . . . 
;
run;

data example2;
   merge a (in=a) b;
   if a;
   by var1;
run;

Here I have forced B to have missing values. Then on the matching records in A those missing values REPLACE the values in A.

Note that the order of the data sets on the Merge statement controls this behavior. Consider:

data example3;
   merge b  a (in=a);
   if a;
   by var1;
run;

So items to consider: Which variables are in which sets, order of merge, actual values of the BY variables.

Which is why Proc contents information was requested.

Tom · Posted 06-23-2021 04:59 PM

This is key observation.

Are you doing a 1-to-many or many-to-many merge?

Does either of the dataset have multiple observations per value of VAR1? (If you did the counts I suggested before does the number of observations match the number of distinct values?)

When doing a 1 to many match (essentially a lookup) the variables you are adding from the "1" dataset must not exist in the "many" dataset. Otherwise only the values on the first instance of this by variable group is updated. When the second, third etc observation for that group is read from the "many" dataset then the values read for those variables will overwrite whatever was read from the "1" dataset.

Tom · Posted 06-23-2021 03:53 PM

If you are using the Z5. format to display the numbers and they are not actually integers then they will LOOK the same, whether or not they are the same.

You could try rounding the value to integers and see if that changes anything. Use the ROUND() function with 1 for the second argument.

lindst801 · Posted 06-23-2021 04:04 PM

I have to use the z5. format to have padding for dataset A (Example, dataset A originally had '4056' when it should be '04056').

Since the information in VAR1 are already integers (ZIP codes), I am not sure rounding to an integer value would help.

Reeza · Posted 06-23-2021 04:10 PM

If both were numbers that wouldn't matter and formats don't affect merging.
And the log would have had errors if that was the case as well.

Please post the proc contents from the data sets you're trying to merge.

lindst801 · Posted 06-23-2021 04:25 PM

Unfortunately, there has been no documentation in my log that there was an error. I only was able to find out that it was an error by conducting a proc print check.

Here is further information from PROC CONTENTS I can provide:

Data set name	Work.a	Observations	1500
Member type	Data	Variables	450
Engine	V9	Indexes	0
Created	------	Observation length	----
Last Modified	------	Deleted observations	0
Protection		Compressed	NO
Data set type		Sorted	YES
Label
Data Representation	----------
Encoding	--------

Sort information
Sortedby	VAR1
Validated	YES
Character Set	ANSI

Dataset b:

Data set name	Work.b	Observations	150000
Member type	Data	Variables	8
Engine	V9	Indexes	0
Created	------	Observation length	----
Last Modified	------	Deleted observations	0
Protection		Compressed	NO
Data set type		Sorted	YES
Label
Data Representation	----------
Encoding	--------

Sort information
Sortedby	VAR1
Validated	YES
Character Set	ANSI

Reeza · Posted 06-23-2021 04:32 PM

You need to provide the format type and label information for VAR1 in each data set as well.

Reeza · Posted 06-23-2021 04:33 PM

Or work with Tech Support (if you have a paid license) as they can take and handle private data, anything posted here is public.

Tom · Posted 06-23-2021 04:22 PM

If the values are ZIP codes then you should be storing them as strings not numbers. There is no meaning to the mean of a zip code. If someone has mistakenly stored them as a number you can use the Z format with the PUT() function to generate a new variable with the strings that include leading zeros.

If you did let SAS autoconvert numbers into strings, for example with code like:

length new_var $20;
new_var=old_var;

Then SAS will right align the digit string it generates to store into the character variable.

477   data test;
478     x=1234;
479     length y $20;
480     y=x;
481     len=length(y);
482     put x= y=:$quote. len=;
483   run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      480:5
x=1234 y="                1234" len=20
NOTE: The data set WORK.TEST has 1 observations and 3 variables.

Check how many matches there are between your datasets to get a better sense of what your issue is.

proc sql;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from A
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from B
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from B
where VAR1 in (select var1 from A)
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from A
where VAR1 in (select var1 from B)
;

Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Re: Issue with data merge, losing information from one dataset that I needed

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!