I just started working as an analyst within a bank. Since they work in databases on different platforms and since the leading analyst has a huge workload, I am tasked with exploring SAS and its possibilities.
One of these databases is a java webbased database build in a program called QIT. This database contains every dossier that has been opened for our customers. This database contains information such as:
- kind_of_document (AVE, DOK, FAK, TRIGGER,...)
- document_handler (Name of the employee of the bank who is handling this dossier.)
- r_creation_date (The date on which the dossier has been created.)
- document_state (CANCELED, HOLD, TODO, TREATED)
The exercise is to create a table in SAS that checks all AVE related dossiers with a document_state of TODO or HOLD, sorted by document_handler. (See results below.) Since I have no access to the QIT database yet, they have given me an Excel file containing said information for the month February. At the end of my exercise, I can easily double-check my results through a similar query in QIT. I noticed that my SAS query yields 45 rows as a result, whereas my QIT query yields 67 rows as a result. I have to be doing something wrong.
QIT query:
select document_handler, document_state, count (*) as amount_of_documents, min (r_creation_Date) as oldest_document
from atp_document
where document_state in ('TODO', 'HOLD')
and kind_of_document like '%AVE%'
and r_creation_date < date ('01/03/2016')
and r_creation_date > date ('31/01/2016')
group by document_handler, document_state
SAS query:
PROC SQL;
CREATE TABLE Atropos.Open_Doc AS
SELECT document_handler, document_state, COUNT(*) AS number_of_documents, MIN(r_creation_date) AS oldest_document FORMAT=DATE7.
FROM Atropos.Documents (=imported Excel file containing the information for the month February)
WHERE document_state IN ('TODO','HOLD')
AND kind_of_document LIKE '%AVE%'
GROUP BY document_handler, document_state;
QUIT;
Result:
Comparing both tables with each other tells me a few things.
- I am missing 22 document_handlers in my SAS table
- The number_of_documents isn't always correct for every document_handler (for example, my query yields 2 documents for handler A and in QIT my query yields 4 documents for handler A
This is the result from QIT and this is CORRECT! (First 4 rows only.)
document_handler | document_state | amount_of_documents | oldest_document |
Handler A | HOLD | 1.0 | 23/02/2016 10:14:31 |
Handler B | HOLD | 4.0 | 19/02/2016 16:39:50 |
Handler C | TODO | 2.0 | 24/02/2016 11:12:46 |
Handler C | HOLD | 4.0 | 2/02/2016 15:43:21 |
This is the result from SAS and this is WRONG! (First 4 rows only.)
document_handler | document_state | amount_of_documents | oldest_document |
Handler B | HOLD | 2.0 | 19/02/2016 |
Handler C | TODO | 1.0 | 29/02/2016 |
Handler D | HOLD | 2.0 | 19/11/2016 |
Handler E | HOLD | 2.0 | 06/02/2016 |
I think something is wrong with the COUNT command, but I cannot seem to put my finger on it. Any help is greatly appreciated.
When you are missing group values, your data can't be correct. Just check the Excel if the missing document_handler values are there at all.
If a missing document_handler is there, look at the other criteria.
Since Excel is NOT suited for data transfer, you may have introduced some glitches during import.
I'd much prefer to get test data in a reasonable data transfer format like CSV.
I don't see as anything has gone "wrong" as such. SQL is passed through an SQL compiler, it is processing your commands slightly different, as there is probably other assumptions in the other SQL compiler. If you write your SQL code such that it is explicit in its execution order then you should get the same result:
proc sql; create table ATROPOS.OPEN_DOC as select DOCUMENT_HANDLER, DOCUMENT_STATE, count(*) as NUMBER_OF_DOCUMENTS, min(R_CREATION_DATE) as OLDEST_DOCUMENT format=date7. from (select * from ATROPOS.DOCUMENTS where DOCUMENT_STATE in ("TODO","HOLD") and KIND_OF_DOCUMENT like '%AVE%') group by DOCUMENT_HANDLER, DOCUMENT_STATE; quit;
You will see from the above that I have moved the whole where clause bit into a sub-query. This means that that gets run first, and the results of that step get fed into the outer SQL which in turn generates your aggregates. Otherwise your up to the whim of the compiler as to when the where gets executed, as with a group by statement you don't use where, but having. So it is likely taking unexpected data, grouping and then where clausing rather than the other way round.
When you are missing group values, your data can't be correct. Just check the Excel if the missing document_handler values are there at all.
If a missing document_handler is there, look at the other criteria.
Since Excel is NOT suited for data transfer, you may have introduced some glitches during import.
I'd much prefer to get test data in a reasonable data transfer format like CSV.
As expected they proivded me with an updated Excel file and forgot to mention this. I ended up comparing apples with pears and you guys noticed that much quicker than me.
The query was indeed correct. Thanks for the feedback!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.