Solved: PROC SQL: Sum column totals over multiple tables

Elle · Posted 08-15-2016 11:48 AM

Hello,

I have an input dataset with 4 important variables: ID, A1, A2, A3.

After processing it, I've split it into 5 datasets: ERR1, ERR2, ERR3, MODEL, NEWINPUT, with the same inputs.

I need to validate the results so I'm trying to do a query to see if:

SUM(input.A1) is equal to SUM( SUM(ERR1.A1), SUM(ERR2.A1) ....., SUM(NEWINPUT.A1) )

Also I need to group it by ID.

My output should look like this:

ID        INPUT_A1             OUTPUT_A1                                  INPUT_A2  OUTPUT_A2  INPUT_A3  OUTPUT_A3          
1        SUM(input.A1)         SUM( SUM(ERR1.A1), SUM(ERR2.A1) .....)       ...        ...        ...       ...

My current code, which is not working, looks like this:

proc sql;
	select coalesce(a.Id, b.Id, c.Id, d.Id, e.Id, f.Id)  as ID,
		   sum(a.A1) as IN_A1, 
		   sum( sum(b.A1), sum(c.A1), sum(d.A1), sum(e.A1), sum(f.A1) ) as OUT_A1,
		   sum(a.A2) as IN_A2, 
		   sum( sum(b.A2), sum(c.A2), sum(d.A2), sum(e.A2), sum(f.A2) ) as OUT_A2,
		   sum(a.A3) as IN_A3,
		   sum( sum(b.A3), sum(c.A3), sum(d.A3), sum(e.A3), sum(f.A3) ) as OUT_A3
	from Input as a, 
		 err.err1 as b,
		 err.err2 as c,
		 err.err3 as d,
		 err.model as e,
		 err.newinput as f
	where a.Id=b.Id=c.Id=d.Id=e.Id=f.Id
	group by calculated ID;
quit;

Any help?

LinusH · Posted 08-17-2016 01:15 AM

Hi
I think it's better if you add information in a separate post (not thread!), it's easier to follow the dialogue.

You could create empty template data sets which you append your results to, ensuring you consistent layout.
Another option is to make this logic into a macro, will let you set parameters and macro logic can do checks on the input data.

But, why do you need this at all, doesn't sound like a one time thing. Don't you trust the logic in the preceeding steps?

Data never sleeps

View solution in original post

LinusH · Posted 08-15-2016 12:50 PM

What is not working...?

First, since you are using inner join, no need to do the coalesce() and the calculated thing.

Second, what does uniquely identify an observation in the tables?

If it isn't ID, you will end up with M-M in your result set.

If it,s ID, no need for group by.

Sample data always help input and output.

Data never sleeps

Elle · Posted 08-17-2016 04:38 AM

UPDATE1:

The INPUT dataset contains information about customer purchases.

ERR1 - is supposed to contain certain rows from the INPUT that fall under ERROR1 (EMPTY at the moment)

ERR2 - rows from INPUT that fall under ERROR2 (EMPTY at the moment)

ERR3 - rows from INPUT that fall under ERROR3 (Has about 10 rows)

MODEL - rows from INPUT that are VALID (Has the bulk of the data)

NEWINPUT - rows from INPUT that will be taken to the next time period where extra data is added and at one point it will become the new input file for the exact same analysis (contains about 5 rows at this point)

All of these tables are mutually exclusive and have the same structure so it would make sense to validate if the split is done correctly by simply verifying the sums of the columns for each variable:

VAR A: SUM(A) for INPUT = SUM (SUM(A) for ERR1, SUM(A) for ERR2, SUM(A) for ERR3, SUM(A) for MODEL, SUM(A) for NEWINPUT)

But all I get with my query is a table with the right header and missing values for each column

Hope it's more clear. Cannot share data as it's sensitive.

PS: ID is unique key

RW9 · Posted 08-16-2016 05:08 AM

Why split the dataset into 5 separate datasets in the first place? As @LinusH has mentioned, provide example test data (form of a datastep) and what the output should look like. I would imagine there are better techniques, using by group for instance.

Elle · Posted 08-17-2016 04:39 AM

UPDATE2:

I've identified the problem: SUM function doesn't operate on empty datasets. Something with its argument needing to be numeric.

Does anyone know a way around this? The 5 datasets are quite interactive and, depending on the INPUT data, each one of them can be empty at any time.

LinusH · Posted 08-17-2016 01:15 AM

Hi
I think it's better if you add information in a separate post (not thread!), it's easier to follow the dialogue.

You could create empty template data sets which you append your results to, ensuring you consistent layout.
Another option is to make this logic into a macro, will let you set parameters and macro logic can do checks on the input data.

But, why do you need this at all, doesn't sound like a one time thing. Don't you trust the logic in the preceeding steps?

Data never sleeps

Elle · Posted 08-17-2016 04:43 AM

The 5 tables are delivered to my supervisor that wants to validate my work.

I'm validating it beforehand to make sure he doesn't get any strange results.

Thank you for the input. I think I'll use macro variables to check the sums.

PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Re: PROC SQL: Sum column totals over multiple tables

Registration is open

SAS Training: Just a Click Away