BookmarkSubscribeRSS Feed
rebelde52
Fluorite | Level 6

How to use proc sql to merge 2 datasets with similar variables (4 different variables in one dataset) but want to match it by a Unique_ID ?

 

I have 2 datasets dataset1 has about 20 variables including a variabled called Unique_ID. 

 

dataset2 has about 24 variables also including Unique_ID  but has 4 other variables that I want to merge into a table with. 

 

I want to merge these tables and match it by Unique_ID. 

 

I've tried the code below but when I do, it will multiply my observations which juristically increases my rows and I do not want that. What am I doing wrong?

proc sql;
    create table test as
    select *
    from dataset1, dataset2
    where dataset1.Unique_ID=dataset2.Unique_ID;
    quit;
6 REPLIES 6
SASKiwi
PROC Star

If the number of rows increase in TEST, that means that either the column dataset1.Unique_ID or column dataset2.Unique_ID or both do NOT contain unique values for each row. Do you expect them to be unique?

rebelde52
Fluorite | Level 6
I excpect for both of them to have it. But if one or neither have it then how should i approach?
SASKiwi
PROC Star

If you are expecting your row IDs to be unique, but they are not then you need to investigate why and then decide on a process to make them unique. This is a data understanding issue. You may find PROC FREQ  useful to understanding where your duplicates are:

proc freq data = test;
  table uniqueID1 uniqueID2 / missing;
run;

Count will be greater than 1 for any duplicates.

Patrick
Opal | Level 21

To define the appropriate join you need to know two things:

  1. What is the relationship between the two tables?
    1. 1:1, 1:many, many:many
  2. What do you want as result?
    1. only the rows where the keys match (inner join)
    2. all rows from the one table but only matching rows from the left table (left or right join)
    3. all rows from both tables (full join)
    4. ....

Below two links are hopefully useful to you.

https://blogs.sas.com/content/sastraining/2015/05/27/life-saver-tip-for-comparing-proc-sql-join-with... 

https://documentation.sas.com/doc/en/vdmmlcdc/8.1/sqlproc/p1bk7i6jqseje7n1lifcip8kzhpp.htm 

rebelde52
Fluorite | Level 6
This is helpful! Question though. I checked my data freq by the unique_id on a specific id #. For example dateset1 had 8 observations for that id#. Dataset2 had 16 for that specific id#. Joineddataset had 128 observations for that specific id#. Therefor it seemed like it multiplied? Why is that? Do I just deduplicate it?
Patrick
Opal | Level 21

@rebelde52 wrote:
This is helpful! Question though. I checked my data freq by the unique_id on a specific id #. For example dateset1 had 8 observations for that id#. Dataset2 had 16 for that specific id#. Joineddataset had 128 observations for that specific id#. Therefor it seemed like it multiplied? Why is that? Do I just deduplicate it?

That means you've got a many:many relationship between your tables and unique_id is not unique. 

Even a SQL inner join will return 8X16 rows which is what you observe.

 

How would you deduplicate? You would need for at least one of the tables all variable values of interest to be identical for a specific "unique" id. Is that the case? If not then which values would you pick in order to "dedup"?

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1018 views
  • 1 like
  • 3 in conversation