Solved: How to Compare variables of two datasets and Merge

hjjijkkl · Posted 05-06-2021 03:07 PM

I have two big datasets and I wanted to compare their variables and merge them. I know there are one or more variables that might be in one or the other. So, I was wondering what is the best way to identify which variables are in common and which variables are in one or the other. After identifying this, I want to merge the two datasets into one dataset. what is the best way to follow and the merging. If someone can show me an example that would be wonderful. Please help! Thank you!

ballardw · Posted 05-06-2021 03:18 PM

Proc compare is likely the easiest way to get a summary of the variable names/ types and differences that are involved.

The question next becomes do your mean to "stack" the data, one set after the other, or align values side-by-side such as on a common identifier?

SAS has a very particular meaning for MERGE and we have many people new to SAS using "merge" for combining data when they actually mean Append (stack vertically).

Which you are doing makes a difference because the side-by-side MERGE will replace values of commonly named variables. So if you need "sales" variable value from both sets there would be additional work needed.

Best is provide small examples of your data and what you expect the result to look like. This can be done in text with 3 or 4 variables from each set, with at least one variable different, for each and then show the expected result.

Example of two data sets.

Data one;
   input id $  x y;
datalines;
abc 1 2
pdq 22 33
;

data two;
   input id $ x z;
datalines;
abc 2 55
zzz  33 888
;

What would the result you are looking for be with these two examples?

View solution in original post

Reeza · Posted 05-06-2021 03:17 PM

Merge is dependent on first step. Have you looked into PROC COMPARE?
The first part of the output will have what you need - ie which variables are in which data set and which are missing. It will also compare types and formats.

proc compare data=data1 compare=data2;
run;

See Part one of the output illustrated here:
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1k00d45g03uv8n1bfx3d20breg6.htm

ballardw · Posted 05-06-2021 03:18 PM

Proc compare is likely the easiest way to get a summary of the variable names/ types and differences that are involved.

The question next becomes do your mean to "stack" the data, one set after the other, or align values side-by-side such as on a common identifier?

SAS has a very particular meaning for MERGE and we have many people new to SAS using "merge" for combining data when they actually mean Append (stack vertically).

Which you are doing makes a difference because the side-by-side MERGE will replace values of commonly named variables. So if you need "sales" variable value from both sets there would be additional work needed.

Best is provide small examples of your data and what you expect the result to look like. This can be done in text with 3 or 4 variables from each set, with at least one variable different, for each and then show the expected result.

Example of two data sets.

Data one;
   input id $  x y;
datalines;
abc 1 2
pdq 22 33
;

data two;
   input id $ x z;
datalines;
abc 2 55
zzz  33 888
;

What would the result you are looking for be with these two examples?

hjjijkkl · Posted 05-06-2021 03:36 PM

Actually you are rights. I need to append(stack vertically) than merging the two data sets. This an example I can give.
ID Sex age
1 F 26
2 F 43

ID Sex age Education
4 M 33 BA
6 F 67 PHD

ID Sex age Education
1 F 26
2 F 43
4 M 33 BA
6 F 67 PHD

Reeza · Posted 05-06-2021 05:12 PM

Then this is all you need.
You can use the INDSNAME option to track which record comes from which data set, if that's needed.

Data want;
length _source source $50.;
Set Table1 Table2 INDSNAME = _source;
source = source;

run;

Sajid01 · Posted 05-07-2021 08:57 AM

As I understand it, your question has two parts:
1.It is desired to identify the variables which are common and which are unique to each of the two datasets.

2.Merge the two datasets.

While there are many approaches, I would may use the following:

Part 1 :Identifying common and unique variables

The following code identifies which variable is present in each dataset. You will have yes if the variable is present in the data set .
You can modify the code to suite your needs. I am giving a proof of concept or demo.

I am using sashelp.class and sashelp.shoes datasets. Again it is for demo purpose. (different people may have different ideas)

proc contents data=sashelp.class out=class (keep=Name);
run;
proc contents data=sashelp.shoes out=shoes (keep=Name);
run;

data var_comparison (keep=Name class shoes);
merge class (in=a) shoes(in=b);
by Name;
if a then Class="Yes" ;
if b then Shoes="Yes";
run;

The output will be something like this. If the variables are common to both, you will have two "YES" in a row. In the example there are no common variables, hence only one yes.

Part 2: Merging the two datasets
This question has been already answered and there is a lot in the literature.
There is one brief review I would suggest you have a look.

https://www.lexjansen.com/nesug/nesug11/ds/ds03.pdf

How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Re: How to Compare variables of two datasets and Merge

Ready to join fellow brilliant minds for the SAS Hackathon?