Re: Merge two data sets with multiple by variables

ph7see · Posted 09-24-2021 12:31 PM

I have two datasets.

Dataset A:

id2000 vhouse2000 id2001 vhouse2001

1 100 3 500

2 200 4 600

3 300

7 700

5 400 8 800

6 500

Dataset B:

id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001

1 10 3 50

2 20 4 60

3 30 5 70

4 40 6 80

5 50 7 90

6 60 8 100

I want to merge two datasets like this:

Merged dataset:

id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001

1 100 10 3 500 50

2 200 20 4 600 60

3 300 30 5 70

4 40 6 80

5 400 50 7 700 90

6 500 60 8 800 100

Please help me solve this problem. Thanks!

ballardw · Posted 09-24-2021 01:24 PM

@ph7see wrote:

I have two datasets.

Dataset A:

id2000 vhouse2000 id2001 vhouse2001

1 100 3 500

2 200 4 600

3 300

7 700

5 400 8 800

6 500

Dataset B:

id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001

1 10 3 50

2 20 4 60

3 30 5 70

4 40 6 80

5 50 7 90

6 60 8 100

I want to merge two datasets like this:

Merged dataset:

id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001

1 100 10 3 500 50

2 200 20 4 600 60

3 300 30 5 70

4 40 6 80

5 400 50 7 700 90

6 500 60 8 800 100

Please help me solve this problem. Thanks!

From what you show this has a chance of working:

data merged;
   merge data_b data_a (drop=Id2000);
run;

You don't say anything about which "by variables" you might want to use from your subject line. If one of them is ID2000 you can't because it is MISSING. If you do not have a by variable merge will match the first row of the sets, then the second and so forth.

If your datasets are not in a matching order and you want to use any of those ID variables with missing values you need to fix the missing values before attempting to merge as you would need to sort them and the missing values would make the order almost certainly incorrect for a simple merge.

The order of datasets on the MERGE statement is important because the variables with the same names will have the values of the LAST ordered data set on the merge. Which is why I drop the ID2000 variable, the missing value would overwrite the one from data set b.

BTW, having what appear to be Year values in variable names for multiple similar named variables, vhouse2000 and vhouse2001 for example, often indicates a poor data structure. It would generally be better in the long run to have a separate variable with the Year and then just plain id, vhouse, vstock,

ph7see · Posted 09-24-2021 02:00 PM

Thank you so much for your help.

Id2000 variables in both data A and data B should be matched and update Vhouse2000 variable in data B based on data A.

Similarly, Vhouse2001 variable needs to be updated from data A.

The problem is that even though id2000=1 and id2001=3 are in the first low together, they are different persons.

I ran your code to merge two datasets.

The result looks like this:

id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001

1 100 10 3 500 50

2 200 20 4 600 60

3 300 30 70

4 40 7 700 80

5 400 50 8 800 90

6 500 60 100

Everything for year2000 (id2000, vhouse2000, and vstock2000) is good. However, two observations in id2001 are deleted (id2001 = 5 and 6) and as a result, the locations of id2001 =7 and 8 are wrong.

ballardw · Posted 09-24-2021 04:20 PM

@ph7see wrote:

Thank you so much for your help.

Id2000 variables in both data A and data B should be matched and update Vhouse2000 variable in data B based on data A.

Can't be done. Your data A did not have a value of 4 for Id2000. So there is not a way to match on the value of a variable that is missing. A variable to match on would have to be present to get expected results because to match BY value the values are going to need to be sorted and missing will not sort in the order you would want from the given example data.

Or fix your posted example.

Your data is so poor I wasn't interested in writing a data step from scratch to recreate it, so the code is not tested and was a close I could get some of the values.

Better yet provide data step that replicates your data set. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

ph7see · Posted 09-24-2021 04:28 PM

Thank you for suggestions. I will I think I forgot writing a value of 4 in the column of id2000. Thank you for pointing out.

Data one;
input id2000 vhouse2000 id2001 vhouse2001;
datalines;
1 100 3 500
2 200 4 600
3 300 . .
4 . 7 700
5 400 8 800
6 500 . .
;
run;

Data two;
input id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001;
datalines;
1 . 10 3 . 50
2 . 20 4 . 60
3 . 30 5 . 70
4 . 40 6 . 80
5 . 50 7 . 90
6 . 60 8 . 100
;
run;

data merged;
merge two one (drop=id2000);
run;

ballardw · Posted 09-24-2021 05:07 PM

For this exact data:

data want;
   update two one (drop=id2001);
   by id2000;
run;

Update is not the same as merge. The default behavior is that only values present in the update data set (the second on the Update statement) replace values in the base data set (the first one on the update statement).

In your example two has the values of id2001 you appear to want in the output so drop them from set One to avoid overwriting the data.

Caution: This approach will not work if data set Two has repeats of values for Id2000. Update is intended to change values for only a master data set with one record for each value of the BY variable(s).

If you get third data set with 2002 values I pray for you as this kind of structure gets harder to work with when there are more similar values.

ph7see · Posted 09-24-2021 06:37 PM

I appreciate your advice. The original data's structure is poor like you mentioned. I will try to change the formatting. Thanks again.

Kurt_Bremser · Posted 09-26-2021 03:33 AM

Before you proceed any further, transpose the datasets so that the years have their own column alongside id, vhouse and vstock. Then it's a simple join or merge by id and year.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Call for Content

Classroom Training Available!