Hi!
I am asked to make sure a bug won’t happen again elsewhere in my codes. But I spent the last 2 weeks trying to figure out why this happens and I still don’t understand it. Could you please help me figure it out?
(A) Context
I work on SAS 9.4 TS Level 1M3, X64_SRV12 plate-forme, in French.
My table “My_huge_in_table” has 5,746,903,189 rows and 26 columns.
(B) What is supposed to happen
I created the table “STEP_1_FIRST” by joining this table “My_huge_in_table” with my smaller table “my_small_in_table” to add its variable “my_filter_variable”. Then I create the table “THEN_STEP_2” by filtering “STEP_1_FIRST” on my_filter_variable =1. Hence I obtain all 17,772,627 rows supposed to be there.
(C) What is not supposed to happen
Now if if do both operation simultaneously, i.e. if I join and I filter in the same time, I only obtain 5,720,957 rows in table “BOTH_STEPS_1_AND_2” instead of the 17,772,627 rows supposed to be there. This is the bug I am asked to make sure won't happened again elsewhere.
(D) Help required
What happens? How can I make sure this won’t happen elsewhere? Spliting the operation in two worked here, but as I don't understand why, it could be sheer luck so I am not even sure than spliting in the same way elsewhere would prevent such a bug to happen again.
My log is here below. If you could please enlighten me, it would be very useful!
Best regards.
Axel Renoux
499 proc sql; 500 create table STEP_1_FIRST as 501 select t.* 502 ,f.my_filter_variable 503 from my_huge_in_table t 504 inner join my_small_in_table f 505 on t.id = f.id 506 ; NOTE: Table WORK.STEP_1_FIRST created, with 22430607 rows and 27 columns. 507 quit; NOTE: PROCEDURE SQL used (Total process time): real time 17:32.70 cpu time 17:32.76 508 proc sql; 509 create table THEN_STEP_2 as 510 select * 511 from STEP_1_FIRST 512 where my_filter_variable = 1 513 ; NOTE: Table WORK.THEN_STEP_2 created, with 17772627 rows and 27 columns. 514 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5.38 seconds cpu time 5.39 seconds 515 proc sql; 516 create table BOTH_STEPS_1_AND_2 as 517 select t.* 518 from my_huge_in_table t 519 inner join my_small_in_table f 520 on t.id = f.id 521 where f.my_filter_variable = 1 522 ; NOTE: Table WORK.BOTH_STEPS_1_AND_2 created, with 5720957 rows and 26 columns. 523 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5:41:30.52 cpu time 1:23:56.15
One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.
That might indicate that the method chosen by PROC SQL to implement the query is having an impact. Whether or not a source is sorted is information that PROC SQL will use to decide which way to process the data.
If you do not find an aspect of the data that is the actual cause of your issue then this type of observation is something that you give to SAS support to help them help you.
First, run a few checks. See what kind of relationship with regard to ID you have (one-to-many, many-to-one, many-to-many).
Count the results in both datasets per ID, then compare these two to see if IDs are missing in the smaller one, or where certain IDs have less results in it. Then look at the source observations for non-matching IDs to see a pattern.
What do you get when you run this:
data check;
set my_huge_in_table;
if _n_ = 1
then do;
declare hash f (dataset:"my_small_in_table (where=(my_filter_variable = 1))");
f.definekey("id");
f.definedone();
end;
if f.check() = 0;
run;
?
Thank you very much @Kurt_Bremser for helping me!
See what kind of relationship with regard to ID you have (one-to-many, many-to-one, many-to-many).
It is many-to-one: ID is a foreign key in my_huge_in_table referencing the primary key of my_small_in_table.
Count the results in both datasets per ID, then compare these two to see if IDs are missing in the smaller one, or where certain IDs have less results in it. Then look at the source observations for non-matching IDs to see a pattern.
There are 21,789 distinct ID in my_small_in_table, with 1 row each (21,789 rows). No missing ID.
There are 21,564 distinct ID in my_huge_in_table for 22,430,607 rows. No missing ID. All the 21,564 ID of my_huge_in_table are also in my_small_in_table.
16,641 of those 21,564 ID represented in my_huge_in_table have (my_filter_variable = 1) in my_small_in_table. This represents 17,772,627 rows of my_huge_in_table.
The other 4,923 ID have (my_filter_variable = 0) in my_small_in_table and represents 4,657,980 rows of my_huge_in_table.
In my table BOTH_STEPS_1_AND_2 there are 5,720,957 of the 17,772,627 rows supposed to be there, belonging to 13,575 of the 16,641 ID supposed to be there.
The missing 12,051,670 rows belong to 15,131 of the 16,641 ID supposed to be there.
I haven't found yet any pattern.
What do you get when you run this:data check; set my_huge_in_table; if _n_ = 1 then do; declare hash f (dataset:"my_small_in_table (where=(my_filter_variable = 1))"); f.definekey("id"); f.definedone(); end; if f.check() = 0; run;
?
NOTE: There were 16641 observations read from the data set WORK.MY_SMALL_IN_TABLE. WHEREmy_filter_variable
=1;
NOTE: There were 22430607 observations read from the data set WORK.MY_HUGE_IN_TABLE.
NOTE: The data set WORK.CHECK has 17772627 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 9.25 seconds
cpu time 5.78 seconds
I shall confess I have not yet followed your advice to read SAS® Hash Object Programming Made Easy by Michele M. Burlew, hence I can't do hash programming on my own.
So my hash check reveals that the larger number is correct. From my understanding, your second SQL should work like the first and lead to the same result (I did some experiments with fake data).
I also see now that your second, all-in-one attempt uses EXTREMELY more time than the first two-step method. This is VERY suspicious.
Is your data actually coming from an external database (and you masked the names for privacy reasons)? This could mean that your query is translated and sent to the DB, which acts funny under certain circumstances.
If all this really happens in WORK, then something very fishy (IMO) is going on here. A 20-fold time increase while basically performing the same join is alarming.
Anyway, since nobody else chimed in who could shed light on this, I suggest you open a track with SAS technical support. Given their NDA with regards to customer data, you can share all details with them.
The hash objects are a VERY useful, rather recent addition to the data step language. I can only recommend in the warmest way that you acquaint yourself with them. I have been able to get significant performance gains and simpler, shorter code in production jobs in my professional work. Just look at the time the hash step took for the same operation. Less than 10 seconds vs. more than 17 minutes with SQL. Also see my Maxim 10.
There was (and still is) a reason why I trust DATA steps more than SQL, and use SQL in production jobs only when absolutely necessary, or when they simply CAN'T do something wrong. But I never did complex things in SQL, it performs like wading through molasses.
Hi @Renoux,
I would try to narrow down the possible reason of the strange result by reducing the datasets to (ideally) only a few IDs, hoping that these small samples behave similarly as the original datasets. (If they don't, this will be another hint.)
For example, you wrote
In my table BOTH_STEPS_1_AND_2 there are 5,720,957 of the 17,772,627 rows supposed to be there, belonging to 13,575 of the 16,641 ID supposed to be there.
The missing 12,051,670 rows belong to 15,131 of the 16,641 ID supposed to be there.
This implies that there are at least (13,575+15,131-16,641=) 12,065 IDs for which, strangely enough, only some, but not all observations from my_huge_in_table make it into both_steps_1_and_2. It would be interesting to see if this still happens if my_huge_in_table and my_small_in_table are replaced by their restrictions to just one of those IDs.
Even a reduction of the dataset sizes by a factor of 1000 or so (e.g., by drawing a random sample from the set of IDs) as an intermediate step would likely facilitate further investigations.
What puzzles me is that I can see no logical difference between the two-step SQL, the one-step SQL and the data step/hash (for relationships like the OP posted). I was actually hoping that a real SAS SQL expert (can you get one, @ChrisHemedinger ?) chimes in and provides a solid reason why the one-step SQL does not work.
@Kurt_Bremser wrote:
What puzzles me is that I can see no logical difference between the two-step SQL, the one-step SQL and the data step/hash (for relationships like the OP posted).
The only scenario where I saw a decrease in the number of observations (like the OP did) in my test cases was if the "huge" table already contained a variable named my_filter_variable. But this would have caused a warning message in the log, unless the OP had used the NOWARN option of the PROC SQL statement, neither of which appears to be the case. (Edit: Also the number of variables wouldn't have dropped then.)
I am not a sql-expert, but maybe using "feedback" option in the proc sql statement reveals what's going on here. If everything is in WORK, that the impressive increased runtime (first two compared with third) is indeed alarming, contacting your local sas admins recommended to let them check, how work is setup (os-compression should be disabled). Checking memory setup (proc options group="memory";run;) may reveal another bottleneck. Maybe the third sql-call needs to much memory, so that swapping takes place, ruining the runtime.
In the end, i would contact tech support, they won't guess like i did 😉
Thank you very much for your answer. This is so kind of you all! Sorry I didn't answer yesterday, it was my day off.
@Kurt_Bremser wrote:
Is your data actually coming from an external database (and you masked the names for privacy reasons)? This could mean that your query is translated and sent to the DB, which acts funny under certain circumstances.
Yes, I did mask the names for privacy reasons. The data are in table I create myself (by data steps or sql procedures) in libraries I create myself by stating "libname itsmylib "D:/blablabla/mylib;" without any extra option.
@FreelanceReinh wrote:
I would try to narrow down the possible reason of the strange result by reducing the datasets to (ideally) only a few IDs, hoping that these small samples behave similarly as the original datasets. (If they don't, this will be another hint.)
[...]
It would be interesting to see if this still happens if my_huge_in_table and my_small_in_table are replaced by their restrictions to just one of those IDs.
Even a reduction of the dataset sizes by a factor of 1000 or so (e.g., by drawing a random sample from the set of IDs) as an intermediate step would likely facilitate further investigations.
Actually, those already were dataset reduced by a factor of 10.
Let's select 3 IDs in "my_small_in_table" : id1, id2 and id3. Each has 4 rows in table "my_huge_in_table". id1 has all 4 rows present in table "both_steps_1_and_2"; id2 has all 4 rows missing in table "both_steps_1_and_2"; id3 has 3 rows missing and 1 row present in "table both_steps_1_and_2".
I create the table "my_not_so_huge_subset" with those 4+4+4=12 rows. I run again my sql procdure. The problem disappear and new notes appear :
NOTE: This SAS session is using a registry in WORK. All changes will be lost at the end of this session. 3211 proc sql; 3212 create table my_not_so_huge_subset as select * from my_huge_in_table where num_enq in 3212! ("id1","id2","id3"); NOTE: Table my_not_so_huge_subset created, with 12 rows and 5 columns. 3213 quit; NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the "BEST" format. NOTE: PROCEDURE SQL used (Total process time): real time 2.92 seconds cpu time 1.23 seconds 3214 proc sql; 3215 create table both_steps_1_and_2_in_subset as 3216 select distinct t.* 3217 from my_not_so_huge_subset t 3218 inner join my_small_in_table f 3219 on t.id = f.id 3229 order by id 3232 ; NOTE: Table both_steps_1_and_2_in_subset created, with 12 rows and 5 columns. 3233 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.05 seconds cpu time 0.04 seconds
@andreas_lds wrote:
I am not a sql-expert, but maybe using "feedback" option in the proc sql statement reveals what's going on here. If everything is in WORK, that the impressive increased runtime (first two compared with third) is indeed alarming, contacting your local sas admins recommended to let them check, how work is setup (os-compression should be disabled). Checking memory setup (proc options group="memory";run;) may reveal another bottleneck. Maybe the third sql-call needs to much memory, so that swapping takes place, ruining the runtime.
In the end, i would contact tech support, they won't guess like i did 😉
I should probably contact tech support.
I tried the feedback option in the sql procedure. It's still running. For now it only tells me "NOTE: Statement transforms to:" then it write my sql request like before except iits expands all variables contained in "t.*". Maybe will there be more information when it finishes running.
567 568 proc options group=memory; run; SAS (r) Proprietary Software Version 9.4 TS1M3 Group=MEMORY SORTSIZE=1073741824 Specifies the amount of memory that is available to the SORT procedure. SUMSIZE=0 Specifies a limit on the amount of memory that is available for data summarization procedures when class variables are active. MAXMEMQUERY=0 Specifies the maximum amount of memory that is allocated for procedures. MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries. MEMMAXSZ=2147483648 Specifies the maximum amount of memory to allocate for using memory-based libraries. LOADMEMSIZE=0 Specifies a suggested amount of memory that is needed for executable programs loaded by SAS. MEMSIZE=2147483648 Specifies the limit on the amount of virtual memory that can be used during a SAS session. REALMEMSIZE=0 Specifies the amount of real memory SAS can expect to allocate. NOTE: PROCEDURE OPTIONS used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
So you do not use any external database, everything happens in SAS. This is important, as it strengthens my argument that this looks like a bug in SAS SQL.
Thanks for implementing the suggestions.
@Renoux wrote:
3214 proc sql; 3215 create table both_steps_1_and_2_in_subset as 3216 select distinct t.* 3217 from my_not_so_huge_subset t 3218 inner join my_small_in_table f 3219 on t.id = f.id 3229 order by id 3232 ; NOTE: Table both_steps_1_and_2_in_subset created, with 12 rows and 5 columns. 3233 quit;
Isn't the WHERE condition f.my_filter_variable=1 essential for creating the "both_steps..." dataset?
I see two differences.
First you have eliminated one of the variables from the output dataset.
Second you are using using WHERE instead of HAVING in the second query. WHERE will apply to data on the way in. HAVING applies to the data on the way out.
What happens when you apply the WHERE to the first query?
create table STEP_1_FIRST_subset as
select t.*
, f.my_filter_variable
from my_huge_in_table t
inner join
(select id,my_filter_variable from my_small_in_table
where my_filter_variable=1
) f
on t.id = f.id
;
To see if your lookup table has duplicates do:
select max(nobs) from
(select id,count(*) as nobs from my_small_in_table group by id)
;
If ID is indeed a primary key in my_small_in_table, then this should be an equivalent query:
proc sql;
create table filtered_table as
select *
from my_huge_in_table
where ID in
(select ID from my_small_in_table where my_filter_variable = 1 );
quit;
and might shed some light (or confusion!) on what's going on here.
Hi!
Thank yo for your many useful advices.
The time my sql procedure takes to run, often more than an hour, makes it hard for me to test every lead you give me. To speed things up, I am first trying to reduce my sample size. To do so, I am checking that the issue remains on this smaller sample. Sometimes it does but often it doesn't, depending on how I create my subsample. To make sure I have a subsample on which this issue remains, I am now taking the long way to create this subsample, recreating my subsample from the source. This will take all week-end to run. I will be able to test every lead you gave me only this Monday.
One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.
Regards.
Axel Renoux
One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.
That might indicate that the method chosen by PROC SQL to implement the query is having an impact. Whether or not a source is sorted is information that PROC SQL will use to decide which way to process the data.
If you do not find an aspect of the data that is the actual cause of your issue then this type of observation is something that you give to SAS support to help them help you.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.