08-15-2016 11:48 AM - edited 08-17-2016 04:38 AM
I have an input dataset with 4 important variables: ID, A1, A2, A3.
After processing it, I've split it into 5 datasets: ERR1, ERR2, ERR3, MODEL, NEWINPUT, with the same inputs.
I need to validate the results so I'm trying to do a query to see if:
SUM(input.A1) is equal to SUM( SUM(ERR1.A1), SUM(ERR2.A1) ....., SUM(NEWINPUT.A1) )
Also I need to group it by ID.
My output should look like this:
ID INPUT_A1 OUTPUT_A1 INPUT_A2 OUTPUT_A2 INPUT_A3 OUTPUT_A3 1 SUM(input.A1) SUM( SUM(ERR1.A1), SUM(ERR2.A1) .....) ... ... ... ...
My current code, which is not working, looks like this:
proc sql; select coalesce(a.Id, b.Id, c.Id, d.Id, e.Id, f.Id) as ID, sum(a.A1) as IN_A1, sum( sum(b.A1), sum(c.A1), sum(d.A1), sum(e.A1), sum(f.A1) ) as OUT_A1, sum(a.A2) as IN_A2, sum( sum(b.A2), sum(c.A2), sum(d.A2), sum(e.A2), sum(f.A2) ) as OUT_A2, sum(a.A3) as IN_A3, sum( sum(b.A3), sum(c.A3), sum(d.A3), sum(e.A3), sum(f.A3) ) as OUT_A3 from Input as a, err.err1 as b, err.err2 as c, err.err3 as d, err.model as e, err.newinput as f where a.Id=b.Id=c.Id=d.Id=e.Id=f.Id group by calculated ID; quit;
08-17-2016 01:15 AM
08-15-2016 12:50 PM
What is not working...?
First, since you are using inner join, no need to do the coalesce() and the calculated thing.
Second, what does uniquely identify an observation in the tables?
If it isn't ID, you will end up with M-M in your result set.
If it,s ID, no need for group by.
Sample data always help input and output.
08-17-2016 04:38 AM
The INPUT dataset contains information about customer purchases.
ERR1 - is supposed to contain certain rows from the INPUT that fall under ERROR1 (EMPTY at the moment)
ERR2 - rows from INPUT that fall under ERROR2 (EMPTY at the moment)
ERR3 - rows from INPUT that fall under ERROR3 (Has about 10 rows)
MODEL - rows from INPUT that are VALID (Has the bulk of the data)
NEWINPUT - rows from INPUT that will be taken to the next time period where extra data is added and at one point it will become the new input file for the exact same analysis (contains about 5 rows at this point)
All of these tables are mutually exclusive and have the same structure so it would make sense to validate if the split is done correctly by simply verifying the sums of the columns for each variable:
VAR A: SUM(A) for INPUT = SUM (SUM(A) for ERR1, SUM(A) for ERR2, SUM(A) for ERR3, SUM(A) for MODEL, SUM(A) for NEWINPUT)
But all I get with my query is a table with the right header and missing values for each column
Hope it's more clear. Cannot share data as it's sensitive.
PS: ID is unique key
08-16-2016 05:08 AM
Why split the dataset into 5 separate datasets in the first place? As @LinusH has mentioned, provide example test data (form of a datastep) and what the output should look like. I would imagine there are better techniques, using by group for instance.
08-17-2016 04:39 AM
I've identified the problem: SUM function doesn't operate on empty datasets. Something with its argument needing to be numeric.
Does anyone know a way around this? The 5 datasets are quite interactive and, depending on the INPUT data, each one of them can be empty at any time.
08-17-2016 01:15 AM
08-17-2016 04:43 AM
The 5 tables are delivered to my supervisor that wants to validate my work.
I'm validating it beforehand to make sure he doesn't get any strange results.
Thank you for the input. I think I'll use macro variables to check the sums.
Need further help from the community? Please ask a new question.