BookmarkSubscribeRSS Feed
lindst801
Fluorite | Level 6

Hello, I am having trouble with merging two dataset of unequal sample sizes but need to retain all variables of both datasets. 

 

dataset a: N=1,500 and number of variables = 450

dataset b: N=150,000 and number of variables = 8

 

I want to do the following:

 

  • Do a one to one merge by a variable (VAR1)
  • I want all the information in dataset b (VAR2-VAR8) to merge with dataset a (Final N=1,500)

Code:

proc sort data=a;
by VAR1; run;
proc sort data=b;
by VAR1; run;

data final_data;
merge a(in=a) b;
if a;
by VAR1;
run;

Even though I was able to get the right N (1,500), my final dataset has VAR2-VAR8 variables missing from dataset b.
Is there a way to avoid this error? I need that information in my final dataset.

 

 

VAR1

VAR2

VAR3

VAR4

VAR5

VAR6

VAR7

VAR8

101

 

 

 

 

 

 

 

104

 

 

 

 

 

 

 

115

 

 

 

 

 

 

 

124

 

 

 

 

 

 

 

135

 

 

 

 

 

 

 

137

 

 

 

 

 

 

 

141

 

 

 

 

 

 

 

169

 

 

 

 

 

 

 

16 REPLIES 16
data_null__
Jade | Level 19
I'm guessing the BY variable VALUES in A and B are different even though they may LOOK the same. Perhaps leading spaces if character. Also I usually put the sub-setting IF after the BY but I don't' think that is the problem.
lindst801
Fluorite | Level 6

Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.). 

ballardw
Super User

@lindst801 wrote:

Is this referring to formatting, etc.? VAR1 in both dataset A and B are the same (Numerical, length=8, format z5.). 


No. It is the values of Var1. If there are no values that actually match then the added variables appear but since no matches occur they are missing.

Example:

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
22  1 2 3
33  4 6 8
;
run;

data example1;
  merge a (in=a)  b;
  if a;
  by var1;
run;

This example is small enough that you can tell by eye that no Var1 values are the same in both sets.

Remove the "if a;" to see a different result.

 

lindst801
Fluorite | Level 6

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example) 

 

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;

 The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well. 

Reeza
Super User

Rather than hypotheticals, if you can show your full code and log that would be easier. Are there any warnings or notes in the log?

 

For the values to be empty it means one of two things is happening:

  1. There is no match found in the other table so they get assigned empty
  2. The match is found on VAR1 but all the variables in the data set are empty for some reason (data collection issues?)

It's usually a variant on 1, where the data won't match because one is  "A" versus "a", where the case is different. 

I know you say  you have numeric but given what you've stated these are the likely possibilities. 

 

That being said, you could have not told us something that is affecting the results - therefore the request for the actual code and log to help diagnose the issue.

 


@lindst801 wrote:

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example) 

 

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;

 The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well. 


 

ballardw
Super User

@lindst801 wrote:

That is not how the dataset I am working with is set up. This is a case where it is the actual following (pulling from your example) 

 

data a;
   input var1 varb varc;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
2  1 2 3
3  4 6 8
;
run;

 The final result from the program that they values did match (VAR1), the issue is that the reminder of the dataset b data is missing (VAR2-VAR8 is missing).

Once I had removed 'if a', not only do I not get matches (I do not get a final desired n of 1,500 but instead 150,000) but VAR2-VAR8 is set to missing as well. 


You have shown exactly 0 records of your actual data.

I will bet a short stack of $$$ that your data actual values do not match.

Or that var2-var8 are missing on the ones that do match. If the same variables appear in both sets the values in B would overwrite the ones from A. See this example:

data a;
   input var1 var2 var3;
datalines;
1  1 2 3 
2  2 4 6
;

data b;
   input var1 var2 var3;
datalines;
1 . . .
2 . . . 
;
run;

data example2;
   merge a (in=a) b;
   if a;
   by var1;
run;

Here I have forced B to have missing values. Then on the matching records in A those missing values REPLACE the values in A.

Note that the order of the data sets on the Merge statement controls this behavior. Consider:

data example3;
   merge b  a (in=a);
   if a;
   by var1;
run;

So items to consider: Which variables are in which sets, order of merge, actual values of the BY variables.

Which is why Proc contents information was requested.

Tom
Super User Tom
Super User

This is key observation.

Are you doing a 1-to-many or many-to-many merge? 

Does either of the dataset have multiple observations per value of VAR1?  (If you did the counts I suggested before does the number of observations match the number of distinct values?)

 

When doing a 1 to many match (essentially a lookup) the variables you are adding from the "1" dataset must not exist in the "many" dataset.  Otherwise only the values on the first instance of this by variable group is updated.  When the second, third etc observation for that group is read from the "many" dataset then the values read for those variables will overwrite whatever was read from the "1" dataset.

 

 

Tom
Super User Tom
Super User

If you are using the Z5. format to display the numbers and they are not actually integers then they will LOOK the same, whether or not they are the same.

 

You could try rounding the value to integers and see if that changes anything.  Use the ROUND() function with 1 for the second argument.

lindst801
Fluorite | Level 6

I have to use the z5. format to have padding for dataset A (Example, dataset A originally had '4056' when it should be '04056'). 

Since the information in VAR1 are already integers (ZIP codes), I am not sure rounding to an integer value would help. 

Reeza
Super User
If both were numbers that wouldn't matter and formats don't affect merging.
And the log would have had errors if that was the case as well.

Please post the proc contents from the data sets you're trying to merge.
lindst801
Fluorite | Level 6

Unfortunately, there has been no documentation in my log that there was an error. I only was able to find out that it was an error by conducting a proc print check. 

 

Here is further information from PROC CONTENTS I can provide:

 

Data set name

Work.a

Observations

1500

Member type

Data

Variables

450

Engine

V9

Indexes

0

Created

------

Observation length

----

Last Modified

------

Deleted observations

0

Protection

 

Compressed

NO

Data set type

 

Sorted

YES

Label

 

 

 

Data Representation

----------

 

 

Encoding

--------

 

 

 

Sort information

Sortedby

VAR1

Validated

YES

Character Set

ANSI

 

Dataset b: 

Data set name

Work.b

Observations

150000

Member type

Data

Variables

8

Engine

V9

Indexes

0

Created

------

Observation length

----

Last Modified

------

Deleted observations

0

Protection

 

Compressed

NO

Data set type

 

Sorted

YES

Label

 

 

 

Data Representation

----------

 

 

Encoding

--------

 

 

 

Sort information

Sortedby

VAR1

Validated

YES

Character Set

ANSI

 

Reeza
Super User
You need to provide the format type and label information for VAR1 in each data set as well.

Reeza
Super User
Or work with Tech Support (if you have a paid license) as they can take and handle private data, anything posted here is public.
Tom
Super User Tom
Super User

If the values are ZIP codes then you should be storing them as strings not numbers.  There is no meaning to the mean of a zip code.  If someone has mistakenly stored them as a number you can use the Z format with the PUT() function to generate a new variable with the strings that include leading zeros. 

 

If you did let SAS autoconvert numbers into strings, for example with code like:

length new_var $20;
new_var=old_var;

Then SAS will right align the digit string it generates to store into the character variable.

477   data test;
478     x=1234;
479     length y $20;
480     y=x;
481     len=length(y);
482     put x= y=:$quote. len=;
483   run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      480:5
x=1234 y="                1234" len=20
NOTE: The data set WORK.TEST has 1 observations and 3 variables.

 

Check how many matches there are between your datasets to get a better sense of what your issue is.

proc sql;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from A
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from B
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from B
where VAR1 in (select var1 from A)
;
select count(*) as nobs
     , count(var1) as n_nonmissing
     , count(distinct var1) as n_values
from A
where VAR1 in (select var1 from B)
;

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 3276 views
  • 0 likes
  • 5 in conversation