Solved: Re: Less rows than supposed to after a proc sql

Renoux · Posted 07-19-2022 07:25 AM

Hi!

I am asked to make sure a bug won’t happen again elsewhere in my codes. But I spent the last 2 weeks trying to figure out why this happens and I still don’t understand it. Could you please help me figure it out?

(A) Context
I work on SAS 9.4 TS Level 1M3, X64_SRV12 plate-forme, in French.
My table “My_huge_in_table” has 5,746,903,189 rows and 26 columns.

(B) What is supposed to happen
I created the table “STEP_1_FIRST” by joining this table “My_huge_in_table” with my smaller table “my_small_in_table” to add its variable “my_filter_variable”. Then I create the table “THEN_STEP_2” by filtering “STEP_1_FIRST” on my_filter_variable =1. Hence I obtain all 17,772,627 rows supposed to be there.

(C) What is not supposed to happen
Now if if do both operation simultaneously, i.e. if I join and I filter in the same time, I only obtain 5,720,957 rows in table “BOTH_STEPS_1_AND_2” instead of the 17,772,627 rows supposed to be there. This is the bug I am asked to make sure won't happened again elsewhere.

(D) Help required

What happens? How can I make sure this won’t happen elsewhere? Spliting the operation in two worked here, but as I don't understand why, it could be sheer luck so I am not even sure than spliting in the same way elsewhere would prevent such a bug to happen again.

My log is here below. If you could please enlighten me, it would be very useful!

Best regards.

Axel Renoux

499  proc sql;
500  create table STEP_1_FIRST as
501      select   t.*
502              ,f.my_filter_variable
503      from        my_huge_in_table         t
504      inner join  my_small_in_table        f
505          on  t.id           =   f.id
506      ;
NOTE: Table WORK.STEP_1_FIRST created, with 22430607 rows and 27 columns.

507  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           17:32.70
      cpu time            17:32.76


508  proc sql;
509  create table THEN_STEP_2 as
510      select   *
511      from    STEP_1_FIRST
512      where   my_filter_variable   =   1
513      ;
NOTE: Table WORK.THEN_STEP_2 created, with 17772627 rows and 27 columns.

514  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5.38 seconds
      cpu time            5.39 seconds


515  proc sql;
516  create table BOTH_STEPS_1_AND_2 as
517      select t.*
518      from        my_huge_in_table         t
519      inner join  my_small_in_table        f
520          on  t.id           =   f.id
521      where   f.my_filter_variable =   1
522      ;
NOTE: Table WORK.BOTH_STEPS_1_AND_2 created, with 5720957 rows and 26 columns.

523  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5:41:30.52
      cpu time            1:23:56.15

Tom · Posted 07-22-2022 09:59 AM

One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.

That might indicate that the method chosen by PROC SQL to implement the query is having an impact. Whether or not a source is sorted is information that PROC SQL will use to decide which way to process the data.

If you do not find an aspect of the data that is the actual cause of your issue then this type of observation is something that you give to SAS support to help them help you.

View solution in original post

Kurt_Bremser · Posted 07-19-2022 09:07 AM

First, run a few checks. See what kind of relationship with regard to ID you have (one-to-many, many-to-one, many-to-many).

Count the results in both datasets per ID, then compare these two to see if IDs are missing in the smaller one, or where certain IDs have less results in it. Then look at the source observations for non-matching IDs to see a pattern.

What do you get when you run this:

data check;
set my_huge_in_table;
if _n_ = 1
then do;
  declare hash f (dataset:"my_small_in_table (where=(my_filter_variable = 1))");
  f.definekey("id");
  f.definedone();
end;
if f.check() = 0;
run;

?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Renoux · Posted 07-19-2022 09:50 AM

Thank you very much @Kurt_Bremser for helping me!

See what kind of relationship with regard to ID you have (one-to-many, many-to-one, many-to-many).

It is many-to-one: ID is a foreign key in my_huge_in_table referencing the primary key of my_small_in_table.

Count the results in both datasets per ID, then compare these two to see if IDs are missing in the smaller one, or where certain IDs have less results in it. Then look at the source observations for non-matching IDs to see a pattern.

There are 21,789 distinct ID in my_small_in_table, with 1 row each (21,789 rows). No missing ID.

There are 21,564 distinct ID in my_huge_in_table for 22,430,607 rows. No missing ID. All the 21,564 ID of my_huge_in_table are also in my_small_in_table.

16,641 of those 21,564 ID represented in my_huge_in_table have (my_filter_variable = 1) in my_small_in_table. This represents 17,772,627 rows of my_huge_in_table.
The other 4,923 ID have (my_filter_variable = 0) in my_small_in_table and represents 4,657,980 rows of my_huge_in_table.

In my table BOTH_STEPS_1_AND_2 there are 5,720,957 of the 17,772,627 rows supposed to be there, belonging to 13,575 of the 16,641 ID supposed to be there.
The missing 12,051,670 rows belong to 15,131 of the 16,641 ID supposed to be there.

I haven't found yet any pattern.

What do you get when you run this:

data check;
set my_huge_in_table;
if _n_ = 1
then do;
  declare hash f (dataset:"my_small_in_table (where=(my_filter_variable = 1))");
  f.definekey("id");
  f.definedone();
end;
if f.check() = 0;
run;

?

NOTE: There were 16641 observations read from the data set WORK.MY_SMALL_IN_TABLE.
      WHERE my_filter_variable=1;
NOTE: There were 22430607 observations read from the data set WORK.MY_HUGE_IN_TABLE. 
NOTE: The data set WORK.CHECK has 17772627 observations and 5 variables. 
NOTE: DATA statement used (Total process time):
real time 9.25 seconds
cpu time 5.78 seconds

I shall confess I have not yet followed your advice to read SAS® Hash Object Programming Made Easy by Michele M. Burlew, hence I can't do hash programming on my own.

Kurt_Bremser · Posted 07-20-2022 05:37 AM

So my hash check reveals that the larger number is correct. From my understanding, your second SQL should work like the first and lead to the same result (I did some experiments with fake data).

I also see now that your second, all-in-one attempt uses EXTREMELY more time than the first two-step method. This is VERY suspicious.

Is your data actually coming from an external database (and you masked the names for privacy reasons)? This could mean that your query is translated and sent to the DB, which acts funny under certain circumstances.

If all this really happens in WORK, then something very fishy (IMO) is going on here. A 20-fold time increase while basically performing the same join is alarming.

Anyway, since nobody else chimed in who could shed light on this, I suggest you open a track with SAS technical support. Given their NDA with regards to customer data, you can share all details with them.

The hash objects are a VERY useful, rather recent addition to the data step language. I can only recommend in the warmest way that you acquaint yourself with them. I have been able to get significant performance gains and simpler, shorter code in production jobs in my professional work. Just look at the time the hash step took for the same operation. Less than 10 seconds vs. more than 17 minutes with SQL. Also see my Maxim 10.

There was (and still is) a reason why I trust DATA steps more than SQL, and use SQL in production jobs only when absolutely necessary, or when they simply CAN'T do something wrong. But I never did complex things in SQL, it performs like wading through molasses.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 07-21-2022 06:58 AM

Hi @Renoux,

I would try to narrow down the possible reason of the strange result by reducing the datasets to (ideally) only a few IDs, hoping that these small samples behave similarly as the original datasets. (If they don't, this will be another hint.)

For example, you wrote

In my table BOTH_STEPS_1_AND_2 there are 5,720,957 of the 17,772,627 rows supposed to be there, belonging to 13,575 of the 16,641 ID supposed to be there.
The missing 12,051,670 rows belong to 15,131 of the 16,641 ID supposed to be there.

This implies that there are at least (13,575+15,131-16,641=) 12,065 IDs for which, strangely enough, only some, but not all observations from my_huge_in_table make it into both_steps_1_and_2. It would be interesting to see if this still happens if my_huge_in_table and my_small_in_table are replaced by their restrictions to just one of those IDs.

Even a reduction of the dataset sizes by a factor of 1000 or so (e.g., by drawing a random sample from the set of IDs) as an intermediate step would likely facilitate further investigations.

Kurt_Bremser · Posted 07-21-2022 10:48 AM

What puzzles me is that I can see no logical difference between the two-step SQL, the one-step SQL and the data step/hash (for relationships like the OP posted). I was actually hoping that a real SAS SQL expert (can you get one, @ChrisHemedinger ?) chimes in and provides a solid reason why the one-step SQL does not work.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 07-21-2022 11:38 AM

@Kurt_Bremser wrote:

What puzzles me is that I can see no logical difference between the two-step SQL, the one-step SQL and the data step/hash (for relationships like the OP posted).

The only scenario where I saw a decrease in the number of observations (like the OP did) in my test cases was if the "huge" table already contained a variable named my_filter_variable. But this would have caused a warning message in the log, unless the OP had used the NOWARN option of the PROC SQL statement, neither of which appears to be the case. (Edit: Also the number of variables wouldn't have dropped then.)

andreas_lds · Posted 07-20-2022 06:48 AM

I am not a sql-expert, but maybe using "feedback" option in the proc sql statement reveals what's going on here. If everything is in WORK, that the impressive increased runtime (first two compared with third) is indeed alarming, contacting your local sas admins recommended to let them check, how work is setup (os-compression should be disabled). Checking memory setup (proc options group="memory";run;) may reveal another bottleneck. Maybe the third sql-call needs to much memory, so that swapping takes place, ruining the runtime.

In the end, i would contact tech support, they won't guess like i did 😉

Renoux · Posted 07-21-2022 10:59 AM

Thank you very much for your answer. This is so kind of you all! Sorry I didn't answer yesterday, it was my day off.

@Kurt_Bremser wrote:
Is your data actually coming from an external database (and you masked the names for privacy reasons)? This could mean that your query is translated and sent to the DB, which acts funny under certain circumstances.

Yes, I did mask the names for privacy reasons. The data are in table I create myself (by data steps or sql procedures) in libraries I create myself by stating "libname itsmylib "D:/blablabla/mylib;" without any extra option.

@FreelanceReinh wrote:

I would try to narrow down the possible reason of the strange result by reducing the datasets to (ideally) only a few IDs, hoping that these small samples behave similarly as the original datasets. (If they don't, this will be another hint.)
[...]
It would be interesting to see if this still happens if my_huge_in_table and my_small_in_table are replaced by their restrictions to just one of those IDs.

Even a reduction of the dataset sizes by a factor of 1000 or so (e.g., by drawing a random sample from the set of IDs) as an intermediate step would likely facilitate further investigations.

Actually, those already were dataset reduced by a factor of 10.

Let's select 3 IDs in "my_small_in_table" : id1, id2 and id3. Each has 4 rows in table "my_huge_in_table". id1 has all 4 rows present in table "both_steps_1_and_2"; id2 has all 4 rows missing in table "both_steps_1_and_2"; id3 has 3 rows missing and 1 row present in "table both_steps_1_and_2".

I create the table "my_not_so_huge_subset" with those 4+4+4=12 rows. I run again my sql procdure. The problem disappear and new notes appear :

NOTE: This SAS session is using a registry in WORK.  All changes will be lost at the end of
      this session.
3211  proc sql;
3212  create table my_not_so_huge_subset as select * from my_huge_in_table where num_enq in
3212! ("id1","id2","id3");
NOTE: Table my_not_so_huge_subset created, with 12 rows and 5 columns.

3213  quit;
NOTE: At least one W.D format was too small for the number to be printed. The decimal may be
      shifted by the "BEST" format.
NOTE: PROCEDURE SQL used (Total process time):
      real time           2.92 seconds
      cpu time            1.23 seconds

3214  proc sql;
3215  create table both_steps_1_and_2_in_subset as
3216      select distinct t.*
3217      from        my_not_so_huge_subset                   t
3218      inner join  my_small_in_table        f
3219          on  t.id           =   f.id
3229  order by id
3232      ;
NOTE: Table both_steps_1_and_2_in_subset created, with 12 rows and 5 columns.

3233  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.05 seconds
      cpu time            0.04 seconds

@andreas_lds wrote:

I am not a sql-expert, but maybe using "feedback" option in the proc sql statement reveals what's going on here. If everything is in WORK, that the impressive increased runtime (first two compared with third) is indeed alarming, contacting your local sas admins recommended to let them check, how work is setup (os-compression should be disabled). Checking memory setup (proc options group="memory";run;) may reveal another bottleneck. Maybe the third sql-call needs to much memory, so that swapping takes place, ruining the runtime.

In the end, i would contact tech support, they won't guess like i did 😉

I should probably contact tech support.

I tried the feedback option in the sql procedure. It's still running. For now it only tells me "NOTE: Statement transforms to:" then it write my sql request like before except iits expands all variables contained in "t.*". Maybe will there be more information when it finishes running.

567
568  proc options group=memory; run;

    SAS (r) Proprietary Software Version 9.4  TS1M3


Group=MEMORY
 SORTSIZE=1073741824
                   Specifies the amount of memory that is available to the SORT procedure.
 SUMSIZE=0         Specifies a limit on the amount of memory that is available for data
                   summarization procedures when class variables are active.
 MAXMEMQUERY=0     Specifies the maximum amount of memory that is allocated for procedures.
 MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries.
 MEMMAXSZ=2147483648
                   Specifies the maximum amount of memory to allocate for using memory-based
                   libraries.
 LOADMEMSIZE=0     Specifies a suggested amount of memory that is needed for executable
                   programs loaded by SAS.
 MEMSIZE=2147483648
                   Specifies the limit on the amount of virtual memory that can be used during
                   a SAS session.
 REALMEMSIZE=0     Specifies the amount of real memory SAS can expect to allocate.
NOTE: PROCEDURE OPTIONS used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Kurt_Bremser · Posted 07-21-2022 11:14 AM

So you do not use any external database, everything happens in SAS. This is important, as it strengthens my argument that this looks like a bug in SAS SQL.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 07-21-2022 11:15 AM

Thanks for implementing the suggestions.

@Renoux wrote:

3214  proc sql;
3215  create table both_steps_1_and_2_in_subset as
3216      select distinct t.*
3217      from        my_not_so_huge_subset                   t
3218      inner join  my_small_in_table        f
3219          on  t.id           =   f.id
3229  order by id
3232      ;
NOTE: Table both_steps_1_and_2_in_subset created, with 12 rows and 5 columns.

3233  quit;

Isn't the WHERE condition f.my_filter_variable=1 essential for creating the "both_steps..." dataset?

Tom · Posted 07-21-2022 11:35 AM

I see two differences.

First you have eliminated one of the variables from the output dataset.

Second you are using using WHERE instead of HAVING in the second query. WHERE will apply to data on the way in. HAVING applies to the data on the way out.

What happens when you apply the WHERE to the first query?

create table STEP_1_FIRST_subset as
  select t.*
       , f.my_filter_variable
  from my_huge_in_table t
  inner join 
    (select id,my_filter_variable from  my_small_in_table
     where my_filter_variable=1
    ) f
  on  t.id = f.id
;

To see if your lookup table has duplicates do:

select max(nobs) from
  (select id,count(*) as nobs from my_small_in_table group by id)
;

PGStats · Posted 07-21-2022 03:18 PM

If ID is indeed a primary key in my_small_in_table, then this should be an equivalent query:

proc sql;
create table filtered_table as
select * 
from my_huge_in_table
where ID in 
	(select ID from my_small_in_table where my_filter_variable = 1 );
quit;

and might shed some light (or confusion!) on what's going on here.

PG

Renoux · Posted 07-22-2022 09:53 AM

Hi!

Thank yo for your many useful advices.

The time my sql procedure takes to run, often more than an hour, makes it hard for me to test every lead you give me. To speed things up, I am first trying to reduce my sample size. To do so, I am checking that the issue remains on this smaller sample. Sometimes it does but often it doesn't, depending on how I create my subsample. To make sure I have a subsample on which this issue remains, I am now taking the long way to create this subsample, recreating my subsample from the source. This will take all week-end to run. I will be able to test every lead you gave me only this Monday.

One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.

Regards.

Axel Renoux

Tom · Posted 07-22-2022 09:59 AM

One thing I noticed though : the issue disappears if I add "proc sort data=my_huge_in_table; by id; run;" before my sql procedure.

That might indicate that the method chosen by PROC SQL to implement the query is having an impact. Whether or not a source is sorted is information that PROC SQL will use to decide which way to process the data.

If you do not find an aspect of the data that is the actual cause of your issue then this type of observation is something that you give to SAS support to help them help you.

SAS Innovate 2025: Call for Content

Classroom Training Available!