Solved: Joint on big table should not make values disappear

Renoux · Posted 02-28-2022 10:58 AM

Hi.

Would you please help me figure why I have missing values where there should not be?

I work with the software SAS 9.4 TS Level 1M3 X64_SRV12 plate-forme (in french).

My table A.sas7bdat contains 5,746,903,189 rows. This table A.sas7bdat is sorted by the column “key”. My table A.sas7bdat also contains a column “id” with a unique value for each row, no missing value.

My table B.sas7bdat contains 83 rows, sorted by the column “key”, with a different value of “key” at each row. My table B.sas7bdat also contains a column “target”.

I observe missing values for “C2.target” where there should not be when I run:

proc sql;

create table C1 as select B.target, A.id from A left join B on A.key=B.key;

create table C2 as select target, count(id) as my_count from C1 group by target;

create table C3 as select sum(my_count) as my_count from C2;

quit;

There are no missing values for “D2.target” when I run:

proc sql;

create table D1 as select key, count(id) as my_count from A group by key;

create table D2 as select B.target, sum(my_count) as my_count from D1 left join B on D1.key=B.key group by target;

create table D3 as select sum(my_count) as my_count from D2;

quit;

I check C3 is identical to D3. C2 should be identical to D2 but is not.

The only table I really need is C1.

No error warning appears in my log. The bug is still there when I re-run several times my program after coding a few details differently. This is not my actual code, but a simplification for clarity's sake.

I guess this bug could be caused by the way SAS treats very big table. How can I force my SAS software to resolve this bug?

Best regards.

Axel Renoux, statistician

Tom · Posted 02-28-2022 09:29 PM

The number of observations should not matter.

If you are really interested in merging the two datasets by KEY then use data step instead of SQL.

So a "left join" would look something like this as testing the IN= variable from the A dataset will insure that only observations contributed to by A will be output to WANT.

data want;
  merge A(in=in1) B(in=in2);
  by id;
  if in1;
run;

The number of observations in WANT should be equal to or larger than the number of observations in A. It will be larger if there are some values of ID that appear in both A and B but appear multiple times in B.

If you are interested in performance and dataset B is really that small then convert the relationship between ID and TARGET from dataset B into a FORMAT (or an INFORMAT perhaps if ID is a character string and TARGET is numeric) and you will not have to merge them at all or even sort the large dataset A by ID.

Say you created a format named TARGET that converted values of ID into the strings in the variable TARGET. Then you code is just:

data want;
  set a;
  target = put(id,target.);
run;

If when you do the left join between A and B you get fewer observations then it is because you have reduced the set of unique observations. You said before that ID was unique (so has 5M+ distinct values). Are you sure? What do you get from this query:

create table dups as
  select *,count(*) as nobs
  from A  
  group by id
  having count(*) > 1
;

If you get zero observations then ID is unique (at least within A) otherwise look at the observations with the same values of ID and try to figure out what is going on.

View solution in original post

Astounding · Posted 02-28-2022 03:56 PM

It sounds like some values of KEY appear in Table B, but not in Table A. You will have to verify whether that is the case.

Renoux · Posted 02-28-2022 04:11 PM

It is "A left join B". It takes all values in A but not necessarily all values in B.

My guess is the problem may be that the joining procedure would screen through values of "A.key" without reading all 5 billions rows hence misses out some values. But this is just a wild guess. Is that even possible?

Renoux · Posted 02-28-2022 04:13 PM

Thank you do much for helping me! But I don't think this is the cause of my problem.

SASKiwi · Posted 02-28-2022 05:01 PM

Is target in both tables or just in B? If in both tables, try changing your GROUP BY to B.target. Is your joining KEY variable character or numeric? If it is numeric and close to 15 digits then you may get precision errors and joins not working as expected.

Renoux · Posted 02-28-2022 05:48 PM

Thank you very much for your reply.
Table A has two colums ("id" and "key") x 5,746,903,189 rows. Table B has two colums ("key" and "target") x 83 rows. Format is 12. for "id", 3. for "key" and 3. for "target". Table C1 is supposed to have two colums ("id" and "target") x 5,746,903,189 rows.

Tom · Posted 02-28-2022 05:22 PM

I cannot figure out want you think those various things are testing.

Can you describe what you are trying to do with each of those steps?

What is the actual overall goal of this exercise?

Why is A sorted by KEY if ID is the variable that uniquely identifies the rows?

Why would you expect C2 to not have missing values of TARGET? You explicitly told SQL in the first step that it was ok to have missing values on B.TARGET by forcing in all observations from A whether or not there was a corresponding observation in B. Plus TARGET might having missing values in table B even on the observations of A that did match something from B.

Renoux · Posted 02-28-2022 06:06 PM

Thank you very much for your reply.
What I have are table A and B. What I want is table C1.
Both tables A and B are sorted by "key" because the plan is to join them and A has many rows, so this sorting order hopes to make the join operation run quicker.
Further use I have of table C1 have shown some of the data in C1 are incorrect. I create C2, C3, D1, D2 and D3 only to proove that the problem comes from joining A and B to create C1. I think it is because 5,746,903,189 would be too many rows for the join. This is why I try to compare (C1 to C3) to the results of a join on less rows (D1 to D3). I see C2 is different from D2 (I see missing values in C2 but not in D2) while C3 is identical to D3.
If this join on 5,746,903,189 rows worked properly, then by "sum(id) from C1 group by target" (=C2) I would have the same result (C2 is supposed to be identical to D2) than by "sum(id) from A by key" (=D1) and only then on joining on the 83 remaining rows (=D2). Creating C3 by suming C2 and creating D3 by suming D2 is just to check I haven't lost nor added any rows in the process. C3=D3 hence the missing values in C2 are truly some non-missing in D2 whose values have mysteriously vanished when I created C1.
This is supposed to proove something failed when I joined on 5,746,903,189 rows.

Tom · Posted 02-28-2022 09:29 PM

The number of observations should not matter.

If you are really interested in merging the two datasets by KEY then use data step instead of SQL.

So a "left join" would look something like this as testing the IN= variable from the A dataset will insure that only observations contributed to by A will be output to WANT.

data want;
  merge A(in=in1) B(in=in2);
  by id;
  if in1;
run;

The number of observations in WANT should be equal to or larger than the number of observations in A. It will be larger if there are some values of ID that appear in both A and B but appear multiple times in B.

If you are interested in performance and dataset B is really that small then convert the relationship between ID and TARGET from dataset B into a FORMAT (or an INFORMAT perhaps if ID is a character string and TARGET is numeric) and you will not have to merge them at all or even sort the large dataset A by ID.

Say you created a format named TARGET that converted values of ID into the strings in the variable TARGET. Then you code is just:

data want;
  set a;
  target = put(id,target.);
run;

If when you do the left join between A and B you get fewer observations then it is because you have reduced the set of unique observations. You said before that ID was unique (so has 5M+ distinct values). Are you sure? What do you get from this query:

create table dups as
  select *,count(*) as nobs
  from A  
  group by id
  having count(*) > 1
;

If you get zero observations then ID is unique (at least within A) otherwise look at the observations with the same values of ID and try to figure out what is going on.

Renoux · Posted 02-28-2022 06:08 PM

The LEFT JOIN is supposed to ensure forcing all observations from A whether or not there was a corresponding observation in B.

Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Re: Joint on big table should not make values disappear

Classroom Training Available!