Solved: Re: The query requires remerging summary statistics back with the orig...

JPARK · Posted 11-22-2014 05:43 PM

When I run the proc sql sas code below, my program keep running and it does not stop

In note, it says the query requires remerging summary statistics back with the original data.

Does anybody know why it keep doing this?

proc sql;

create table comp3 as select a.*, abs(a.roa-b.roa) as diff_roa, (a.r-b.r) as pamjone

from DAPAMJ_1 as a left join DAPAMJ_1 as b

on a.sic2=b.sic2 and a.fyear=b.fyear and a.gvkey^=b.gvkey

group by a.gvkey, a.fyear, a.sic2

having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));

quit;

Reeza · Posted 11-22-2014 05:47 PM

Because you've selected a.* in your select statement.

In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together.

View solution in original post

Reeza · Posted 11-22-2014 05:47 PM

Because you've selected a.* in your select statement.

In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together.

JPARK · Posted 11-22-2014 06:00 PM

oh I see~ thank you!

JPARK · Posted 11-22-2014 06:15 PM

It still run forever. I changed my code like below. are there still a problem in my code?

proc sql;

create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r, abs(a.roa-b.roa) as diff_roa, (a.r-b.r) as pamjone

from DAPAMJ_1 as
a left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear and a.gvkey^=b.gvkey

group by
a.gvkey, a.fyear, a.sic2

having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));

quit;

Reeza · Posted 11-22-2014 09:56 PM

How big is your data?

Try it in steps, first join without having clause and see if it runs.

Haikuo · Posted 11-22-2014 10:09 PM

Looks like a deja vu to me and I strongly concur with Reeza. There are two possible places that may cause long time running:

1. Reeze pointed out: the having clause.

2. The unequal look up: a.gvkey^=b.gvkey

Break them into multiple steps to see how it goes. If it is still unbearable slow, then we need to know more specs about your tables, a Hash look up may be available.

Haikuo

PGStats · Posted 11-22-2014 11:15 PM

You could also try the equivalent operation (as far as I can tell) as a datastep using random access to your sorted data :

proc sort data=comp3; by sic2 fyear roa; run;

data comp3;

set DAPAMJ_1; by sic2 fyear;

p = _n_ - 1;

if not first.fyear then

set DAPAMJ_1(keep=gvkey roa rename=(gvkey=pgvkey roa=proa)) point=p;

else call missing(pgvkey, proa);

n = _n_ + 1;

if not last.fyear then

set DAPAMJ_1(keep=gvkey roa rename=(gvkey=ngvkey roa=nroa)) point=n;

else call missing(ngvkey, nroa);

if n(ngvkey, pgvkey) = 2 then

if abs(roa-proa) < abs(roa-nroa) then do;

_gvkey = pgvkey;

_roa = proa;

end;

else do;

_gvkey = ngvkey;

_roa = nroa;

end;

else do;

_gvkey = coalesce(ngvkey, pgvkey);

_roa = coalesce(nroa, proa);

end;

if not missing(_roa) then diff_roa = abs(roa-_roa);

drop ngvkey pgvkey nroa proa;

run;

PG

Tom · Posted 11-23-2014 09:47 AM

You are still including THREE non group by variables.

, a.roa

, a.r

, (a.r-b.r) as pamjone

How many observations in each of your input datasets.

JPARK · Posted 11-25-2014 02:57 PM

Thank you~

Should I put those variables in group by?

I have 48,229 observations

PGStats · Posted 11-25-2014 03:59 PM

48229 is a moderate size dataset... it shouldn't take forever to process. One problem might come from the combination of the left join and the ABS function. When there is no record on the right side of the join, i.e. when only one gvkey is present for a given sic2 and fyear, b.roa is missing and computing ABS(a.roa-b.roa) generates a note to the LOG. You could try avoiding those with the changes:

proc sql;

create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r,

case b.roa when . then . else abs(a.roa-b.roa) end as diff_roa,

(a.r-b.r) as pamjone

from DAPAMJ_1 as
a left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear and a.gvkey^=b.gvkey

group by
a.gvkey, a.fyear, a.sic2

having (a.roa-b.roa)**2 = min((a.roa-b.roa)**2);

quit;

PG

The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Re: The query requires remerging summary statistics back with the original data.

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away