Hi.
Would you please help me figure why I have missing values where there should not be?
I work with the software SAS 9.4 TS Level 1M3 X64_SRV12 plate-forme (in french).
My table A.sas7bdat contains 5,746,903,189 rows. This table A.sas7bdat is sorted by the column “key”. My table A.sas7bdat also contains a column “id” with a unique value for each row, no missing value.
My table B.sas7bdat contains 83 rows, sorted by the column “key”, with a different value of “key” at each row. My table B.sas7bdat also contains a column “target”.
I observe missing values for “C2.target” where there should not be when I run:
proc sql;
create table C1 as select B.target, A.id from A left join B on A.key=B.key;
create table C2 as select target, count(id) as my_count from C1 group by target;
create table C3 as select sum(my_count) as my_count from C2;
quit;
There are no missing values for “D2.target” when I run:
proc sql;
create table D1 as select key, count(id) as my_count from A group by key;
create table D2 as select B.target, sum(my_count) as my_count from D1 left join B on D1.key=B.key group by target;
create table D3 as select sum(my_count) as my_count from D2;
quit;
I check C3 is identical to D3. C2 should be identical to D2 but is not.
The only table I really need is C1.
No error warning appears in my log. The bug is still there when I re-run several times my program after coding a few details differently. This is not my actual code, but a simplification for clarity's sake.
I guess this bug could be caused by the way SAS treats very big table. How can I force my SAS software to resolve this bug?
Best regards.
Axel Renoux, statistician
The number of observations should not matter.
If you are really interested in merging the two datasets by KEY then use data step instead of SQL.
So a "left join" would look something like this as testing the IN= variable from the A dataset will insure that only observations contributed to by A will be output to WANT.
data want;
merge A(in=in1) B(in=in2);
by id;
if in1;
run;
The number of observations in WANT should be equal to or larger than the number of observations in A. It will be larger if there are some values of ID that appear in both A and B but appear multiple times in B.
If you are interested in performance and dataset B is really that small then convert the relationship between ID and TARGET from dataset B into a FORMAT (or an INFORMAT perhaps if ID is a character string and TARGET is numeric) and you will not have to merge them at all or even sort the large dataset A by ID.
Say you created a format named TARGET that converted values of ID into the strings in the variable TARGET. Then you code is just:
data want;
set a;
target = put(id,target.);
run;
If when you do the left join between A and B you get fewer observations then it is because you have reduced the set of unique observations. You said before that ID was unique (so has 5M+ distinct values). Are you sure? What do you get from this query:
create table dups as
select *,count(*) as nobs
from A
group by id
having count(*) > 1
;
If you get zero observations then ID is unique (at least within A) otherwise look at the observations with the same values of ID and try to figure out what is going on.
It sounds like some values of KEY appear in Table B, but not in Table A. You will have to verify whether that is the case.
Is target in both tables or just in B? If in both tables, try changing your GROUP BY to B.target. Is your joining KEY variable character or numeric? If it is numeric and close to 15 digits then you may get precision errors and joins not working as expected.
I cannot figure out want you think those various things are testing.
Can you describe what you are trying to do with each of those steps?
What is the actual overall goal of this exercise?
Why is A sorted by KEY if ID is the variable that uniquely identifies the rows?
Why would you expect C2 to not have missing values of TARGET? You explicitly told SQL in the first step that it was ok to have missing values on B.TARGET by forcing in all observations from A whether or not there was a corresponding observation in B. Plus TARGET might having missing values in table B even on the observations of A that did match something from B.
The number of observations should not matter.
If you are really interested in merging the two datasets by KEY then use data step instead of SQL.
So a "left join" would look something like this as testing the IN= variable from the A dataset will insure that only observations contributed to by A will be output to WANT.
data want;
merge A(in=in1) B(in=in2);
by id;
if in1;
run;
The number of observations in WANT should be equal to or larger than the number of observations in A. It will be larger if there are some values of ID that appear in both A and B but appear multiple times in B.
If you are interested in performance and dataset B is really that small then convert the relationship between ID and TARGET from dataset B into a FORMAT (or an INFORMAT perhaps if ID is a character string and TARGET is numeric) and you will not have to merge them at all or even sort the large dataset A by ID.
Say you created a format named TARGET that converted values of ID into the strings in the variable TARGET. Then you code is just:
data want;
set a;
target = put(id,target.);
run;
If when you do the left join between A and B you get fewer observations then it is because you have reduced the set of unique observations. You said before that ID was unique (so has 5M+ distinct values). Are you sure? What do you get from this query:
create table dups as
select *,count(*) as nobs
from A
group by id
having count(*) > 1
;
If you get zero observations then ID is unique (at least within A) otherwise look at the observations with the same values of ID and try to figure out what is going on.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.