- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How to use proc sql to merge 2 datasets with similar variables (4 different variables in one dataset) but want to match it by a Unique_ID ?
I have 2 datasets dataset1 has about 20 variables including a variabled called Unique_ID.
dataset2 has about 24 variables also including Unique_ID but has 4 other variables that I want to merge into a table with.
I want to merge these tables and match it by Unique_ID.
I've tried the code below but when I do, it will multiply my observations which juristically increases my rows and I do not want that. What am I doing wrong?
proc sql;
create table test as
select *
from dataset1, dataset2
where dataset1.Unique_ID=dataset2.Unique_ID;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the number of rows increase in TEST, that means that either the column dataset1.Unique_ID or column dataset2.Unique_ID or both do NOT contain unique values for each row. Do you expect them to be unique?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you are expecting your row IDs to be unique, but they are not then you need to investigate why and then decide on a process to make them unique. This is a data understanding issue. You may find PROC FREQ useful to understanding where your duplicates are:
proc freq data = test;
table uniqueID1 uniqueID2 / missing;
run;
Count will be greater than 1 for any duplicates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To define the appropriate join you need to know two things:
- What is the relationship between the two tables?
- 1:1, 1:many, many:many
- What do you want as result?
- only the rows where the keys match (inner join)
- all rows from the one table but only matching rows from the left table (left or right join)
- all rows from both tables (full join)
- ....
Below two links are hopefully useful to you.
https://documentation.sas.com/doc/en/vdmmlcdc/8.1/sqlproc/p1bk7i6jqseje7n1lifcip8kzhpp.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@rebelde52 wrote:
This is helpful! Question though. I checked my data freq by the unique_id on a specific id #. For example dateset1 had 8 observations for that id#. Dataset2 had 16 for that specific id#. Joineddataset had 128 observations for that specific id#. Therefor it seemed like it multiplied? Why is that? Do I just deduplicate it?
That means you've got a many:many relationship between your tables and unique_id is not unique.
Even a SQL inner join will return 8X16 rows which is what you observe.
How would you deduplicate? You would need for at least one of the tables all variable values of interest to be identical for a specific "unique" id. Is that the case? If not then which values would you pick in order to "dedup"?