When I run the proc sql sas code below, my program keep running and it does not stop
In note, it says the query requires remerging summary statistics back with the original data.
Does anybody know why it keep doing this?
proc sql;
create table comp3 as select a.*, abs(a.roa-b.roa) as diff_roa, (a.r-b.r) as pamjone
from DAPAMJ_1 as a left join DAPAMJ_1 as b
on a.sic2=b.sic2 and a.fyear=b.fyear and a.gvkey^=b.gvkey
group by a.gvkey, a.fyear, a.sic2
having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));
quit;
Because you've selected a.* in your select statement.
In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together.
Because you've selected a.* in your select statement.
In standard SQL you can only have the group by variables plus the aggregated values, but SAS SQL is different and allows you include other variables and remerges the data together.
oh I see~ thank you!
It still run forever. I changed my code like below. are there still a problem in my code?
proc sql;
create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r, abs(a.roa-b.roa) as diff_roa, (a.r-b.r) as pamjone
from DAPAMJ_1 as
a left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear and a.gvkey^=b.gvkey
group by
a.gvkey, a.fyear, a.sic2
having abs(a.roa-b.roa)=min(abs(a.roa-b.roa));
quit;
How big is your data?
Try it in steps, first join without having clause and see if it runs.
Looks like a deja vu to me and I strongly concur with Reeza. There are two possible places that may cause long time running:
1. Reeze pointed out: the having clause.
2. The unequal look up: a.gvkey^=b.gvkey
Break them into multiple steps to see how it goes. If it is still unbearable slow, then we need to know more specs about your tables, a Hash look up may be available.
Haikuo
You could also try the equivalent operation (as far as I can tell) as a datastep using random access to your sorted data :
proc sort data=comp3; by sic2 fyear roa; run;
data comp3;
set DAPAMJ_1; by sic2 fyear;
p = _n_ - 1;
if not first.fyear then
set DAPAMJ_1(keep=gvkey roa rename=(gvkey=pgvkey roa=proa)) point=p;
else call missing(pgvkey, proa);
n = _n_ + 1;
if not last.fyear then
set DAPAMJ_1(keep=gvkey roa rename=(gvkey=ngvkey roa=nroa)) point=n;
else call missing(ngvkey, nroa);
if n(ngvkey, pgvkey) = 2 then
if abs(roa-proa) < abs(roa-nroa) then do;
_gvkey = pgvkey;
_roa = proa;
end;
else do;
_gvkey = ngvkey;
_roa = nroa;
end;
else do;
_gvkey = coalesce(ngvkey, pgvkey);
_roa = coalesce(nroa, proa);
end;
if not missing(_roa) then diff_roa = abs(roa-_roa);
drop ngvkey pgvkey nroa proa;
run;
PG
You are still including THREE non group by variables.
, a.roa
, a.r
, (a.r-b.r) as pamjone
How many observations in each of your input datasets.
Thank you~
Should I put those variables in group by?
I have 48,229 observations
48229 is a moderate size dataset... it shouldn't take forever to process. One problem might come from the combination of the left join and the ABS function. When there is no record on the right side of the join, i.e. when only one gvkey is present for a given sic2 and fyear, b.roa is missing and computing ABS(a.roa-b.roa) generates a note to the LOG. You could try avoiding those with the changes:
proc sql;
create table comp3 as select
a.gvkey, a.sic2, a.fyear, a.roa, a.r,
case b.roa when . then . else abs(a.roa-b.roa) end as diff_roa,
(a.r-b.r) as pamjone
from DAPAMJ_1 as
a left join DAPAMJ_1 as
b on a.sic2=b.sic2 and
a.fyear=b.fyear and a.gvkey^=b.gvkey
group by
a.gvkey, a.fyear, a.sic2
having (a.roa-b.roa)**2 = min((a.roa-b.roa)**2);
quit;
PG
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.