Re: How to use proc sql to merge and match by unique ID?

rebelde52 · Posted 08-31-2023 05:00 PM

How to use proc sql to merge 2 datasets with similar variables (4 different variables in one dataset) but want to match it by a Unique_ID ?

I have 2 datasets dataset1 has about 20 variables including a variabled called Unique_ID.

dataset2 has about 24 variables also including Unique_ID but has 4 other variables that I want to merge into a table with.

I want to merge these tables and match it by Unique_ID.

I've tried the code below but when I do, it will multiply my observations which juristically increases my rows and I do not want that. What am I doing wrong?

proc sql;
    create table test as
    select *
    from dataset1, dataset2
    where dataset1.Unique_ID=dataset2.Unique_ID;
    quit;

SASKiwi · Posted 08-31-2023 05:25 PM

If the number of rows increase in TEST, that means that either the column dataset1.Unique_ID or column dataset2.Unique_ID or both do NOT contain unique values for each row. Do you expect them to be unique?

rebelde52 · Posted 08-31-2023 05:53 PM

I excpect for both of them to have it. But if one or neither have it then how should i approach?

SASKiwi · Posted 08-31-2023 06:09 PM

If you are expecting your row IDs to be unique, but they are not then you need to investigate why and then decide on a process to make them unique. This is a data understanding issue. You may find PROC FREQ useful to understanding where your duplicates are:

proc freq data = test;
  table uniqueID1 uniqueID2 / missing;
run;

Count will be greater than 1 for any duplicates.

Patrick · Posted 08-31-2023 07:36 PM

To define the appropriate join you need to know two things:

What is the relationship between the two tables?
1. 1:1, 1:many, many:many
What do you want as result?
1. only the rows where the keys match (inner join)
2. all rows from the one table but only matching rows from the left table (left or right join)
3. all rows from both tables (full join)
4. ....

Below two links are hopefully useful to you.

https://blogs.sas.com/content/sastraining/2015/05/27/life-saver-tip-for-comparing-proc-sql-join-with...

https://documentation.sas.com/doc/en/vdmmlcdc/8.1/sqlproc/p1bk7i6jqseje7n1lifcip8kzhpp.htm

rebelde52 · Posted 08-31-2023 07:53 PM

This is helpful! Question though. I checked my data freq by the unique_id on a specific id #. For example dateset1 had 8 observations for that id#. Dataset2 had 16 for that specific id#. Joineddataset had 128 observations for that specific id#. Therefor it seemed like it multiplied? Why is that? Do I just deduplicate it?

Patrick · Posted 08-31-2023 08:17 PM

@rebelde52 wrote:
This is helpful! Question though. I checked my data freq by the unique_id on a specific id #. For example dateset1 had 8 observations for that id#. Dataset2 had 16 for that specific id#. Joineddataset had 128 observations for that specific id#. Therefor it seemed like it multiplied? Why is that? Do I just deduplicate it?

That means you've got a many:many relationship between your tables and unique_id is not unique.

Even a SQL inner join will return 8X16 rows which is what you observe.

How would you deduplicate? You would need for at least one of the tables all variable values of interest to be identical for a specific "unique" id. Is that the case? If not then which values would you pick in order to "dedup"?

How to use proc sql to merge and match by unique ID?