Re: Data check program: logic

Danglytics · Posted 06-21-2011 01:27 PM

Hi,

Every month I receive a data dump of indicators and counts by id.
the tables are loaded in the following manner:
src.indicators_20110131
src.indicators_20110228
src.indicators_20110331
etc..
the latest month should always have more records than the previous.

and look like this:
id count1 ind1 count2 ind2
1 3 1 0 0
2 0 0 2 1
..
essentially if the count exists the indicator is a 1, else 0 .

What I want to do is to create a checking program that looks at the %change in # records from the previous month.

My current logic/checking program is quite manual and I hope to alter it to be more automatic with some sort of iterative processing.

My program now looks like:
%let month1 = 20110131
%let month2 = 20110228
etc.

proc sql create table src.check_&month1 as
select count(*) as numrows_&month1
from src.indicator_&month1 ;

code above repeated for every months data, then datastep to merge [merge check:] .

I havent completed the calculation step yet.

This was just a first stab at it, and is very manual (every month I would have to add a proc sql step above.

If anyone has any suggestions that would be great.
Thanks for your help!

Reeza · Posted 06-21-2011 03:24 PM

Do the old files change with each data dump?

If not you can write a macro to just get the count from the latest file...if they do then you'll want to do it a different way.

What type of files are you dealing with, ie text or SAS files?

You may want to look at proc datasets/contents with the lib._all_ keyword.

It also looks like your files are labelled with the last day of the month so could write a macro around that based on the date to loop through files.

A simple way if all is sas, is to look at the dictionary.tables table

proc sql;
create table counts as
select libname, memname, nobs, nlobs
from dictionary.tables
where libname='SRC';
quit;

Danglytics · Posted 06-22-2011 08:59 AM

Old files do not change.

All files are SAS files.

i'm not sure if the dictionary.tables will work as there are many other tables within that library.

Reeza · Posted 06-22-2011 11:57 AM

You could expand the code to then include the name of the file in the where clause, ie where libname='SRC' and memname like 'indicators%';
Probably the easiest and most efficient solution.

If you have multiple files with indicators_blah then you could restrict with the year or you could manually create the dataset the first time and automate the update procedure...

create a macrovariable that contains the last day of the month and then update the dataset each time you have a new file. Then run your checks on that file, which should be the same every time I'm guessing.

HTH,
Reeza
May need to tweak for case sensitivity.

%let ddate=20110331;

proc sql;
insert into (dataset)
select count(*), &ddate
from src.indicator_&ddate;
quit;

OR

proc sql;
insert into (dataset)
select memname, nobs nlobs
from dictionary.tables
where libname='SRC' and memname='indicators_&ddate";
quit;

Peter_C · Posted 06-22-2011 12:10 PM

Danglytics

Reeza has provided the foundation of your most efficient solution.
Having saved the results from last month, you would just need to add the result for the new month.
there are very few items of information required of each new supply ( data set name, date received, number of rows ), There is no need to keep adding columns, just add rows,
proc sql noprint ;
select memname, modate format= dtdate9., nobs
into :latest_mem, :latestdt, :nobs_new
from dictionary.tables
where libname='SRC'
and memname like 'INDICATORS_2%'
having memname = max(memname)
;
quit ;
* that collects the details for the latest SAS table in the SRC library named as required ;

data collection ;
modify collection ;
set collection point= nobs nobs=nobs ;
latest_name = data_set_name ;
if latest_name = "&latest_mem" then stop ; * new is already on file;
latest size= number_of_rows ;
date_received = "&latestdt"d ;
data_set_name = "&latest_mem" ;
number_of_rows = &nobs_new ;
pct_change = &nobs_new / latest_size -1 ;
format pct_change percent8.1 ;
output ;
stop ;
run ;
* what that step does is :
open the last row, to collect the latest file size and name,
if there is no change of name, stop processing,
otherwise
set the values of the basic columns (name, date, nobs) to the values for the new file supplied,
derive the percentage change,
add the new row to the collection;

You need to have "collection" in a permanent library.
It is updated "in-place" by adding a new row each time you have new data
You will need to take occasional backups!

(although this code is untested it should be approximately correct)
peterC ok just a data step version of what Reeza posted while I was writing that
Message was edited by: Peter.C

Danglytics · Posted 06-22-2011 01:13 PM

Thank you both for your help.

I see the logic behind using modate as the identifier for the file date, but unfortunately the date does not match.
for example my 20110430 file has date modified of 25May2011

ballardw · Posted 06-22-2011 03:33 PM

> I see the logic behind using modate as the identifier
> for the file date, but unfortunately the date does
> not match.
> for example my 20110430 file has date modified of
> 25May2011

Then parse the MEMNAME for the date string. Or if the modified date is NEVER EVERmore than one month later then using date functions find the last day of the preceding month.

Since you say these are existing SAS data sets and option involving SET and the INDSNAME= option to create a variable with name of the dataset contributing the observations might also be in order.

Peter_C · Posted 06-22-2011 05:33 PM

the suggestion from ballardW is a straightforward adaptation of the code I suggested earlier

proc sql noprint ;
select memname, scan( memname,-1, '_' ), nobs
into :latest_mem, :latestdt, :nobs_new
from dictionary.tables
where libname='SRC'
and memname like 'INDICATORS_2%'
having memname = max(memname)
;
quit ;
* that collects the details for the latest SAS table in the SRC library named as required ;

data collection ;
modify collection ;
set collection point= nobs nobs=nobs ;
latest_name = data_set_name ;
if latest_name = "&latest_mem" then stop ; * new is already on file;
latest_size= number_of_rows ;
date_received = "&latestdt" ;
data_set_name = "&latest_mem" ;
number_of_rows = &nobs_new ;
pct_change = &nobs_new / latest_size -1 ;
output ;
stop ;
run ;

almost nothing else need change because the date string has order.
The remaining change is in the modify step.
date_received = "&latestdt"d ;
no longer assigns a date constant, but a string ...
date_received = "&latestdt" ;

I established an initial "collection" with
data collection ;
length data_set_name $32 number_of_rows 8 date_received $10 pct_change 8 ;
retain data_set_name ' ' number_of_rows 1 date_received ' ' pct_change . ;
format pct_change percent8.1 ;
run ;

after which it was easy to prepare one after another, tables like src.indicators_20110331, src.indicators_20110430 and src.indicators_20110531
which appear in this "collection[pre]
+FSVIEW: WORK.COLLECTION (B)-----------------------------------------------------------+
| Obs data_set_name number_of_rows date_received pct_change |
| |
| 1 1 . |
| 2 INDICATORS_20110331 2 20110331 100.0% |
| 3 INDICATORS_20110430 3 20110430 50.0% |
| 4 INDICATORS_20110531 3 20110531 0.0% |
| |
| |
| |
+---------------------------------------------------------------------------------------+[/pre]

Danglytics · Posted 06-23-2011 01:47 PM

After making the adjustments to fit my actual datasets, the code ran without problem, but when i look at the 'collections' table i get the initial record which is blank except for number_of_rows = 1

and the second record which is the information of my last data set (20110430).

I can't seem to figure out why it is not outputting the datasets in between..
I should have 7 datasets in there, from 20101031 to 20110430.

Thanks again for your replies.

Danglytics · Posted 06-24-2011 10:32 AM

I've triple checked the code, and still now sure why it only outputs the first and last record

Peter_C · Posted 06-24-2011 12:40 PM

it will add new entries, and only the latest

That seemed to be the intention - going forward - Add only for new rows.

Once the initial load of history is complete, the routine I offered would be appropriate.

Perhaps instead of loading the initial row of Collections with just that 1, I should have offered code to load the existing history, a bit like:

proc sql ; create table collect_history as
select memname as data_set_name, nobs as number_of_rows
, scan( memname,-1, '_' ) as date_received
from dictionary.tables
where libname='SRC'
and memname like 'INDICATORS_2%'
order by memname
;
quit ;

* build "proper" collection ;

data collection ;
oldone = number_of_rows ;
set collect_history ;
format pct_change percent8.1 ;
pct_change = ifn( oldone > 0, number_of_rows/ oldone -1, . ) ;
drop oldone ;
run ;

peterC

(discovering the "investment" needed for "improvements" still to be discovered in the Forum - I might have to start a thread seeking some explanations! )

Danglytics · Posted 06-27-2011 08:09 AM

Thank you for your help! This solution worked perfectly.

Classroom Training Available!