Solved: Left join doubling rows

bajtan · Posted 08-04-2022 09:17 AM

Hello,

I am trying to create a table using proc sql and I have a problem. Some but not all rows are getting doubled in the process. The code looks as follows:

PROC SQL;
CREATE TABLE WORK.STEP AS
SELECT
t1.group,
t1.time,

t1.profit
FROM WORK.INPUT_FILE t1
LEFT JOIN WORK.GROUPS t2 on (t2.g0="*" OR (find(t1.group, strip(t2.g0) ) AND (Missing(t2.g1) OR find(t1.group, strip(t2.g1)))
AND (Missing(t2.g2) OR find(t1.group, strip(t2.g2)))));
QUIT;

Main thing here is that t1.group variable has a lot of different values and I wanna use the summary procedure later to aggregate some of the profit numbers. So, the groups I want to use are stored in three variables in table two (g0, g1, g2). But, for some groups it does this weird thing, that this sql procedure duplicates rows for specific groups. Not all of them, just a few, that have nothing particular in common. Does anyone know, why this might be happening?

Thanks in advance,

Bajtan

Tom · Posted 08-04-2022 09:44 AM

An SQL join will create all combinations of matching records. So if one dataset contributes N observations and the other contributes M observations that match those N observations the result is N x M observations. So assuming that there are no duplicates in the input "T1' dataset then more than one observation from "T2" must meet your ON condition for some of the observations from "T1".

For that query since you are only selecting variables from the T1 alias why not just add the DISTINCT keyword?

CREATE TABLE WORK.STEP AS
SELECT  distinct
 t1.group
,t1.time
...

View solution in original post

Tom · Posted 08-04-2022 09:44 AM

An SQL join will create all combinations of matching records. So if one dataset contributes N observations and the other contributes M observations that match those N observations the result is N x M observations. So assuming that there are no duplicates in the input "T1' dataset then more than one observation from "T2" must meet your ON condition for some of the observations from "T1".

For that query since you are only selecting variables from the T1 alias why not just add the DISTINCT keyword?

CREATE TABLE WORK.STEP AS
SELECT  distinct
 t1.group
,t1.time
...

bajtan · Posted 08-16-2022 08:17 AM

I tried to use the distinct keyword and it looked like it helped. But once I got to a table that had more than one-part-group (t2.g1 wasn't missing), the problem was back. There is probably a problem with the condition behind on keyword, but I can't figure it out. Any ideas?

Left join doubling rows

Re: Left join doubling rows

Re: Left join doubling rows

Re: Left join doubling rows

Left join doubling rows

Re: Left join doubling rows

Re: Left join doubling rows

Re: Left join doubling rows

Ready to join fellow brilliant minds for the SAS Hackathon?