BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yves_Boonen
Quartz | Level 8

I want to select all our documents between a date range.

 

I import a CSV file:

(I already changed the dates in the CSV file to the ddMMMyyyy format, i.e. 6APR2016.)

PROC IMPORT DATAFILE="O:\...\documents.csv"
			DBMS=csv replace
			OUT=atropos.ave_docs;
			GETNAMES=yes;
			DATAROW=2;
			DELIMITER=';';
RUN;

 

Next up I start making a query:

PROC SQL;
CREATE TABLE atropos.ave_rap AS
SELECT document_handler, document_state, COUNT(*) AS amount_of_documents, MIN(r_creation_date) AS oldest_document FORMAT=DATE9.
FROM atropos.ave_docs
WHERE document_state IN ('HOLD','TODO')
AND kind_of_document LIKE '%AVE%'
AND r_creation_date BETWEEN '29FEB2016'd AND '01APR2016'd
GROUP BY document_handler, document_state;
QUIT;

 

This is the log:

1152  PROC SQL;
1153  CREATE TABLE atropos.ave_rap AS
1154  SELECT document_handler, document_state, COUNT(*) AS amount_of_documents,
1154! MIN(r_creation_date) AS oldest_document FORMAT=DDMMYY10.
1155  FROM atropos.ave_docs
1156  WHERE document_state IN ('HOLD','TODO')
1157  AND kind_of_document LIKE '%AVE%'
1158  AND r_creation_date BETWEEN '29FEB2016'd AND '01APR2016'd
1159  GROUP BY document_handler, document_state;
NOTE: Table ATROPOS.AVE_RAP created, with 9 rows and 4 columns.

1160  QUIT;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.13 seconds
      cpu time            0.12 seconds


 

Knowing that the CSV-file has 696.911 rows, it would be very unlikely to only receive 9 rows of data after the query.

 

I need all the documents created in March 2016. In other words the r_creation_date has to be greater than 29FEB2016 and smaller than 01APR2016. I have tried different approaches, like BETWEEN, DATE, FORMAT, %LET, SET,... I even changed all my dates to the default SAS format ddMMMyyyy and it still doesn't work. I can't figure this one out, simply because there is no error in the log. Do I need to define the format of the column in the import? Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

The fine thing about data steps is that the log is much more verbose.

So I'd start with

data test1;
set atropos.ave_docs (
  keep=document_handler document_state r_creation_date
  kind_of_document
);
if document_state in ('HOLD','TODO');
run;

By using the subsetting IF instead of a where condition, I get the total number of observations in the source dataset and the number of observations in the output dataset written to the log. This may provide a clue about which condition removes too many observations.

Now I'd refine further

data test2;
set test1;
if index(kind_of_document,'AVE') > 0;
run;

and finally

data test3;
set test2;
if '29feb2016'd <= r_creation_date <= '01apr2016'd;
run;

Now youn can run PROC FREQ on all the intermediate datasets to verify where the document_handler/document_state combinations are lost.

View solution in original post

9 REPLIES 9
LinusH
Tourmaline | Level 20

Check your data.

It would be quite simple to select those dates in Excel.

Open the intermediate data sets in SAS and see if your date values have been imported properly.

Data never sleeps
jklaverstijn
Rhodochrosite | Level 12

I suggest you peel down the query to determine what part of the WHERE is causing this. Most likely the dates are at play. A sample of your data would help us help you.

 

Something to check is how the dates are imported from the CSV. Many things can go wrong here. More likely there is a probkem with the INFORMAT than the FORMAT of your columns. Examine your table; run a PROC FREQ and/or MEANS on it. This will help you understanding the cardinalities of your columns.

 

Regards,

- Jan.

Reeza
Super User

You also have a group by so it may be only 9 after the summary function.

 

test your code by including only the date filter and then adding back filters 1 by 1 until you find the issue. 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

 

I thnk I mentioned in the other post on this, put your restriction on the data within a sub-query (also note the consistent casing, indentations):

proc sql;
  create table ATROPOS.AVE_RAP as
  select  DOCUMENT_HANDLER, 
          DOCUMENT_STATE, 
count(*) as AMOUNT_OF_DOCUMENTS, min(R_CREATION_DATE) as OLDEST_DOCUMENT format=date9. from ( select * from ATROPOS.AVE_DOCS where DOCUMENT_STATE in ('HOLD','TODO') and KIND_OF_DOCUMENT like '%AVE%' and R_CREATION_DATE BETWEEN '29feb2016'd and '01apr2016'd ) group by DOCUMENT_HANDLER, DOCUMENT_STATE; quit;

It is then quite clear what data is feeding into the outer query, and you can actually copy from between the brackets, to another proc sql, to see what data is being fed in - very useful for debugging, and clearer to read.  Once you know what data is going into the outer query, you will be able to QC wether the grouping is indeed what you think it should.  I would guess that from the inner query you end up with 9 distinct values of document_handler + document_state, which you can test - as I said above by copying out the inner query to:

proc sql;
  create table TMP as
  select  distinct DOCUMENT_HANDLER,
          DOCUMENT_STATE
  from    ATROPOS.AVE_DOCS
  where   DOCUMENT_STATE in ('HOLD','TODO')
    and   KIND_OF_DOCUMENT like '%AVE%'
    and   R_CREATION_DATE BETWEEN '29feb2016'd and '01apr2016'd;
quit;
Kurt_Bremser
Super User

The fine thing about data steps is that the log is much more verbose.

So I'd start with

data test1;
set atropos.ave_docs (
  keep=document_handler document_state r_creation_date
  kind_of_document
);
if document_state in ('HOLD','TODO');
run;

By using the subsetting IF instead of a where condition, I get the total number of observations in the source dataset and the number of observations in the output dataset written to the log. This may provide a clue about which condition removes too many observations.

Now I'd refine further

data test2;
set test1;
if index(kind_of_document,'AVE') > 0;
run;

and finally

data test3;
set test2;
if '29feb2016'd <= r_creation_date <= '01apr2016'd;
run;

Now youn can run PROC FREQ on all the intermediate datasets to verify where the document_handler/document_state combinations are lost.

Yves_Boonen
Quartz | Level 8

Thanks for the assistance. I used your script to pin point the issue and it appeared to be a data issue. I am new to this program, so I easily doubt myself. Maybe I should start doubting the people who give me the data sheets. 🙂

 

Thanks! Much appreciated.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I have a simple rule of thumb - and one piece of advice.  

Rule of thumb, if someone gives you Excel files, and can't supply proper data transfer format files - e.g. CSV, XML, datasets, XPT, database file etc. then don't trust the files for a minute.  Excel is a very poor medium (very good as a user interface) for data, as it has a lot of hidden "features" that make a programmers life bad, it is also unstrctured.

 

Advice, for any data transfer always setup a data transfer agreement and get it signed by both parties.  This agreement then details what you will get, the structure etc.  From that your datastep import writes itself, and any differences to the agreement are highlighted and blame can be assigned - thus if it takes you a day to re-write the program to fit data that doesn't fit the agreement it is not your fault.  With any transaction, there has to be description, action, acceptance and too often the programmer is just left with acceptance.

Kurt_Bremser
Super User

Anytime (especially during development/testing) someone hands you data, have them also hand you the query code they used, and the date and time that the query was run, and the parameters if the query has some. Otherwise you won't be able to run a valid compare against production data.

 

I, for instance, keep all the logs of all the batch runs (now for more than 15 years) online or in tape archive. The global init include that is used in all batch programs has a part that writes all parameters (excluding sensitive info like passwords) to the log. So if the question arises "how and when was that particular dataset built" I always have a clear answer.

Yves_Boonen
Quartz | Level 8

Yeah, I have noticed the struggle in the past few days. A co-worker from another departement and with access to the database I should analyze, creates these Excel sheets without giving me any additional information. Without any access to said database, I have very little but the data to go on. If I base my analysis on an Excel sheet filled with faulty data, my report will also be faulty. I have learned that the hard way.

 

Anyway, thanks for sharing your ideas. I will write an e-mail to my sheet provider and ask him provide the query/the parameters.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2936 views
  • 2 likes
  • 6 in conversation