Solved: Re: Proc sql joins: the number of rows doesn't add up!

Haemoglobin17 · Posted 01-09-2023 11:42 AM

Hi everyone!

I have a dataset (A) of approximately 2120000, I wanted to left join a couple of variables from a dataset B with a simple proc sql, when I realized that the number of rows in the resulting dataset (C) was 2140000.

I just wanted to add the two columns to the first rows keeping all the records from the first column. Is there a way?

I also tried a inner join between the first dataset 2120000 (A) rows and the table that I produced (2140000 rows) (C) and I obtained a dataset with 4 million of records! It is so strange! With a inner join the number of rows should drop, isn't it?

Is there any way I can solve this issue?

Thanks

Reeza · Posted 01-09-2023 11:51 AM

You have duplicates on your key joining variables in one or both data sets. You likely need to modify your join criteria somehow.

Table 1

ID	Year
ABC	2001
ABC	2001
DEF	2002

Table 2

ID	Year
ABC	2001
DEF	2002
DEF	2002

Join (left or inner)

ID	YEAR
ABC	2001 (table 1)
ABC	2001 (table 1)
DEF	2002 (table 2)
DEF	2002 (table 2)

@Haemoglobin17 wrote:

Hi everyone!

I have a dataset (A) of approximately 2120000, I wanted to left join a couple of variables from a dataset B with a simple proc sql, when I realized that the number of rows in the resulting dataset (C) was 2140000.

I just wanted to add the two columns to the first rows keeping all the records from the first column. Is there a way?

I also tried a inner join between the first dataset 2120000 (A) rows and the table that I produced (2140000 rows) (C) and I obtained a dataset with 4 million of records! It is so strange! With a inner join the number of rows should drop, isn't it?

Is there any way I can solve this issue?

Thanks

View solution in original post

PaigeMiller · Posted 01-09-2023 11:49 AM

A possible explanation is that dataset B has multiple records for some values of the joined variables.

If that is the case, then using select distinct instead of select might fix the problem. Or it might not fix the problem if those multiple records in data set B have different values.

--
Paige Miller

Haemoglobin17 · Posted 01-09-2023 11:58 AM

Hi Paige,
Thank you for you message! I put a distinct option after the select and the number are nearer to what I wanted: when left joining A to B, C is 2130000 and when inner joining C with A, I obtain a number of rows that is just a bit more than A.
I gave for granted the proc sql joins and considered basic skills, but maybe I didn't understand very much!!

Reeza · Posted 01-09-2023 11:51 AM

You have duplicates on your key joining variables in one or both data sets. You likely need to modify your join criteria somehow.

Table 1

ID	Year
ABC	2001
ABC	2001
DEF	2002

Table 2

ID	Year
ABC	2001
DEF	2002
DEF	2002

Join (left or inner)

ID	YEAR
ABC	2001 (table 1)
ABC	2001 (table 1)
DEF	2002 (table 2)
DEF	2002 (table 2)

@Haemoglobin17 wrote:

Hi everyone!

I have a dataset (A) of approximately 2120000, I wanted to left join a couple of variables from a dataset B with a simple proc sql, when I realized that the number of rows in the resulting dataset (C) was 2140000.

I just wanted to add the two columns to the first rows keeping all the records from the first column. Is there a way?

I also tried a inner join between the first dataset 2120000 (A) rows and the table that I produced (2140000 rows) (C) and I obtained a dataset with 4 million of records! It is so strange! With a inner join the number of rows should drop, isn't it?

Is there any way I can solve this issue?

Thanks

Haemoglobin17 · Posted 01-09-2023 12:28 PM

Hi Reeza!
Thank you for your message!
I added a second key in the joining and a distinct in the select.
Now I loose only 42 cases. It's a big improvement, but I still a bit troubled for these numbers that doesn't match 😮

PaigeMiller · Posted 01-09-2023 12:30 PM

The reason the numbers don't match is that your data doesn't allow them to match. Data set B has replicates that interfere with what you are trying to do (which may not be possible). If you examine data set B, you will see these replicates. Or you could run PROC FREQ on the joining variables.

--
Paige Miller

Reeza · Posted 01-09-2023 01:15 PM

Find the ones that don't match and look at each record to identify the difference manually then update your code to account for it.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away