Hello
when i use proc sql; to create tables sometimes i get too many observations for instance
i used this code
proc sql;
create table control_vars as select a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_executives,d._The_total_remuneration_of_the_f,
d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.LocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
i.BTM1,i.Enterpris__Value_Multiplier,j.big4
from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location as f, roa_roe as g, soe as h,
indictors as i, big4 as j
where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
quit;
then i took around 10 minutes till finished the statement then i got the table that contains the following
79 proc sql;
80 create table control_vars as select
80 ! a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
81 b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_exec
81 ! utives,d._The_total_remuneration_of_the_f,
82 d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.L
82 ! ocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
83 i.BTM1,i.Enterpris__Value_Multiplier,j.big4
84 from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location
84 ! as f, roa_roe as g, soe as h,
85 indictors as i, big4 as j
86 where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and
86 ! a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
NOTE: Table WORK.CONTROL_VARS created, with 54165899 rows and 25 columns.
87 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 11:13.68
cpu time 3:26.12
the issue the program be so slow
please i need a solution
When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:
data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;
data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;
proc sql;
create table want as
select
a.year,
a.code,
a.value1,
b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;
proc print data=want noobs;
run;
The result:
year code value1 value2 2019 1 1 3 2019 1 1 4 2019 1 2 3 2019 1 2 4
shows that you get 2 * 2 observations.
Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.
Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?
Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.
Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).
When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:
data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;
data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;
proc sql;
create table want as
select
a.year,
a.code,
a.value1,
b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;
proc print data=want noobs;
run;
The result:
year code value1 value2 2019 1 1 3 2019 1 1 4 2019 1 2 3 2019 1 2 4
shows that you get 2 * 2 observations.
Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.
Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?
Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.
Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).
@CJac73 wrote:
KirkBresmer, I like the way you code, easy to read.
That's covered in another Maxim (#12). Consistent code layout with a clear visability of functional blocks is a great help in understanding, debugging and maintaining code. I would not be able to handle my responsibilities (>1000 ETL and report batch jobs) if the programs would not be easy to maintain (or write in the first place, thanks to my collection of custom macros).
@Kurt_Bremser thanks i have check my dataset and i found a repetition as you said
now it is working well
regards
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.