The query requires remerging summary statistics back with the original data.

Accepted Solution Solved
Reply
Contributor
Posts: 23
Accepted Solution

The query requires remerging summary statistics back with the original data.

When I run the proc sql sas code below, my program keep running and it does not stop

In note, it says  the query requires remerging summary statistics back with the original data.

Does anybody know why it keep doing this?

proc sql;

create table comp3 as select a.*, abs(a.roa-b.roa) as diff_roa, (a.r-b.r) as pamjone

from DAPAMJ_1 as a left join DAPAMJ_1 as b

on a.sic2=b.sic2 and a.fyear=b.fyear and a.gvkey^=b.gvkey

group by a.gvkey, a.fyear, a.sic2

having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));

quit;


Accepted Solutions
Solution
‎11-22-2014 05:47 PM
Super User
Posts: 19,157

Re: The query requires remerging summary statistics back with the original data.

Because you've selected a.* in your select statement.

In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together. 

View solution in original post


All Replies
Solution
‎11-22-2014 05:47 PM
Super User
Posts: 19,157

Re: The query requires remerging summary statistics back with the original data.

Because you've selected a.* in your select statement.

In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together. 

Contributor
Posts: 23

Re: The query requires remerging summary statistics back with the original data.

oh I see~ thank you!

Contributor
Posts: 23

Re: The query requires remerging summary statistics back with the original data.

It still run forever. I changed my code like below. are there still a problem in my code?

proc sql;

create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r, abs(a.roa-b.roa)
as diff_roa, (a.r-b.r) as pamjone

from DAPAMJ_1 as
a
left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear
and a.gvkey^=b.gvkey

group by
a.gvkey, a.fyear, a.sic2

having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));

quit;

Super User
Posts: 19,157

Re: The query requires remerging summary statistics back with the original data.

How big is your data? 

Try it in steps, first join without having clause and see if it runs.

Respected Advisor
Posts: 3,156

Re: The query requires remerging summary statistics back with the original data.

Looks like a deja vu to me and I strongly concur with Reeza.  There are two possible places that may cause long time running:

1. Reeze pointed out: the having clause.

2. The unequal look up: a.gvkey^=b.gvkey

Break them into multiple steps to see how it goes. If it is still unbearable slow, then we need to know more specs about your tables, a Hash look up may be available.

Haikuo

Respected Advisor
Posts: 4,817

Re: The query requires remerging summary statistics back with the original data.

You could also try the equivalent operation (as far as I can tell) as a datastep using random access to your sorted data :

proc sort data=comp3; by sic2 fyear roa; run;

data comp3;

set DAPAMJ_1; by sic2 fyear;

p = _n_ - 1;

if not first.fyear then

  set DAPAMJ_1(keep=gvkey roa rename=(gvkey=pgvkey roa=proa)) point=p;

else call missing(pgvkey, proa);

n = _n_ + 1;

if not last.fyear then

  set DAPAMJ_1(keep=gvkey roa rename=(gvkey=ngvkey roa=nroa)) point=n;

else call missing(ngvkey, nroa);

if n(ngvkey, pgvkey) = 2 then

    if abs(roa-proa) < abs(roa-nroa) then do;

        _gvkey = pgvkey;

        _roa = proa;

        end;

    else do;

        _gvkey = ngvkey;

        _roa = nroa;

        end;

else do;

    _gvkey = coalesce(ngvkey, pgvkey);

    _roa = coalesce(nroa, proa);

    end;

if not missing(_roa) then diff_roa = abs(roa-_roa);

drop ngvkey pgvkey nroa proa;

run;

PG

PG
Super User
Super User
Posts: 6,845

Re: The query requires remerging summary statistics back with the original data.

You are still including THREE non group by variables.

, a.roa

, a.r

, (a.r-b.r) as pamjone

How many observations in each of your input datasets.

Contributor
Posts: 23

Re: The query requires remerging summary statistics back with the original data.

Thank you~

Should I put those variables in group by?

I have 48,229 observations

Respected Advisor
Posts: 4,817

Re: The query requires remerging summary statistics back with the original data.

48229 is a moderate size dataset... it shouldn't take forever to process. One problem might come from the combination of the left join and the ABS function. When there is no record on the right side of the join, i.e. when only one gvkey is present for a given sic2 and fyear, b.roa is missing and computing ABS(a.roa-b.roa) generates a note to the LOG. You could try avoiding those with the changes:

proc sql;

create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r,

case b.roa when . then . else abs(a.roa-b.roa) end as diff_roa,

(a.r-b.r) as pamjone

from DAPAMJ_1 as
a
left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear
and a.gvkey^=b.gvkey

group by
a.gvkey, a.fyear, a.sic2

having (a.roa-b.roa)**2 = min((a.roa-b.roa)**2);

quit;

PG

PG
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 2223 views
  • 3 likes
  • 5 in conversation