I have two datasets.
Dataset A:
id2000 vhouse2000 id2001 vhouse2001
1 100 3 500
2 200 4 600
3 300
7 700
5 400 8 800
6 500
Dataset B:
id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001
1 10 3 50
2 20 4 60
3 30 5 70
4 40 6 80
5 50 7 90
6 60 8 100
I want to merge two datasets like this:
Merged dataset:
id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001
1 100 10 3 500 50
2 200 20 4 600 60
3 300 30 5 70
4 40 6 80
5 400 50 7 700 90
6 500 60 8 800 100
Please help me solve this problem. Thanks!
@ph7see wrote:
I have two datasets.
Dataset A:
id2000 vhouse2000 id2001 vhouse2001
1 100 3 500
2 200 4 600
3 300
7 700
5 400 8 800
6 500
Dataset B:
id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001
1 10 3 50
2 20 4 60
3 30 5 70
4 40 6 80
5 50 7 90
6 60 8 100
I want to merge two datasets like this:
Merged dataset:
id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001
1 100 10 3 500 50
2 200 20 4 600 60
3 300 30 5 70
4 40 6 80
5 400 50 7 700 90
6 500 60 8 800 100
Please help me solve this problem. Thanks!
From what you show this has a chance of working:
data merged; merge data_b data_a (drop=Id2000); run;
You don't say anything about which "by variables" you might want to use from your subject line. If one of them is ID2000 you can't because it is MISSING. If you do not have a by variable merge will match the first row of the sets, then the second and so forth.
If your datasets are not in a matching order and you want to use any of those ID variables with missing values you need to fix the missing values before attempting to merge as you would need to sort them and the missing values would make the order almost certainly incorrect for a simple merge.
The order of datasets on the MERGE statement is important because the variables with the same names will have the values of the LAST ordered data set on the merge. Which is why I drop the ID2000 variable, the missing value would overwrite the one from data set b.
BTW, having what appear to be Year values in variable names for multiple similar named variables, vhouse2000 and vhouse2001 for example, often indicates a poor data structure. It would generally be better in the long run to have a separate variable with the Year and then just plain id, vhouse, vstock,
Thank you so much for your help.
Id2000 variables in both data A and data B should be matched and update Vhouse2000 variable in data B based on data A.
Similarly, Vhouse2001 variable needs to be updated from data A.
The problem is that even though id2000=1 and id2001=3 are in the first low together, they are different persons.
I ran your code to merge two datasets.
The result looks like this:
id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001
1 100 10 3 500 50
2 200 20 4 600 60
3 300 30 70
4 40 7 700 80
5 400 50 8 800 90
6 500 60 100
Everything for year2000 (id2000, vhouse2000, and vstock2000) is good. However, two observations in id2001 are deleted (id2001 = 5 and 6) and as a result, the locations of id2001 =7 and 8 are wrong.
@ph7see wrote:
Thank you so much for your help.
Id2000 variables in both data A and data B should be matched and update Vhouse2000 variable in data B based on data A.
Can't be done. Your data A did not have a value of 4 for Id2000. So there is not a way to match on the value of a variable that is missing. A variable to match on would have to be present to get expected results because to match BY value the values are going to need to be sorted and missing will not sort in the order you would want from the given example data.
Or fix your posted example.
Your data is so poor I wasn't interested in writing a data step from scratch to recreate it, so the code is not tested and was a close I could get some of the values.
Better yet provide data step that replicates your data set. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
Thank you for suggestions. I will I think I forgot writing a value of 4 in the column of id2000. Thank you for pointing out.
Data one;
input id2000 vhouse2000 id2001 vhouse2001;
datalines;
1 100 3 500
2 200 4 600
3 300 . .
4 . 7 700
5 400 8 800
6 500 . .
;
run;
Data two;
input id2000 vhouse2000 vstock2000 id2001 vhouse2001 vstock2001;
datalines;
1 . 10 3 . 50
2 . 20 4 . 60
3 . 30 5 . 70
4 . 40 6 . 80
5 . 50 7 . 90
6 . 60 8 . 100
;
run;
data merged;
merge two one (drop=id2000);
run;
For this exact data:
data want; update two one (drop=id2001); by id2000; run;
Update is not the same as merge. The default behavior is that only values present in the update data set (the second on the Update statement) replace values in the base data set (the first one on the update statement).
In your example two has the values of id2001 you appear to want in the output so drop them from set One to avoid overwriting the data.
Caution: This approach will not work if data set Two has repeats of values for Id2000. Update is intended to change values for only a master data set with one record for each value of the BY variable(s).
If you get third data set with 2002 values I pray for you as this kind of structure gets harder to work with when there are more similar values.
I appreciate your advice. The original data's structure is poor like you mentioned. I will try to change the formatting. Thanks again.
Before you proceed any further, transpose the datasets so that the years have their own column alongside id, vhouse and vstock. Then it's a simple join or merge by id and year.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.