BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Radwan
Quartz | Level 8

Hello 

when i use proc sql; to create tables sometimes i get too many observations for instance 

i used this code 

proc sql;
create table control_vars as select a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_executives,d._The_total_remuneration_of_the_f,
d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.LocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
i.BTM1,i.Enterpris__Value_Multiplier,j.big4
from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location as f, roa_roe as g, soe as h,
indictors as i, big4 as j
where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
quit;

then i took around 10 minutes till finished the statement then i got the table that contains the following 

 

79 proc sql;
80 create table control_vars as select
80 ! a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
81 b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_exec
81 ! utives,d._The_total_remuneration_of_the_f,
82 d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.L
82 ! ocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
83 i.BTM1,i.Enterpris__Value_Multiplier,j.big4
84 from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location
84 ! as f, roa_roe as g, soe as h,
85 indictors as i, big4 as j
86 where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and
86 ! a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
NOTE: Table WORK.CONTROL_VARS created, with 54165899 rows and 25 columns.

87 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 11:13.68
cpu time 3:26.12

 

 

the issue the program be so slow 

please i need a solution

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:

data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;

data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;

proc sql;
create table want as
select
  a.year,
  a.code,
  a.value1,
  b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;

proc print data=want noobs;
run;

The result:

year    code    value1    value2

2019      1        1         3  
2019      1        1         4  
2019      1        2         3  
2019      1        2         4  

shows that you get 2 * 2 observations.

 

Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.

Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?

Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.

Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).

View solution in original post

6 REPLIES 6
Kurt_Bremser
Super User

When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:

data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;

data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;

proc sql;
create table want as
select
  a.year,
  a.code,
  a.value1,
  b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;

proc print data=want noobs;
run;

The result:

year    code    value1    value2

2019      1        1         3  
2019      1        1         4  
2019      1        2         3  
2019      1        2         4  

shows that you get 2 * 2 observations.

 

Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.

Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?

Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.

Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).

CJac73
Obsidian | Level 7
KirkBresmer, I like the way you code, easy to read.
Kurt_Bremser
Super User

@CJac73 wrote:
KirkBresmer, I like the way you code, easy to read.

That's covered in another Maxim (#12). Consistent code layout with a clear visability of functional blocks is a great help in understanding, debugging and maintaining code. I would not be able to handle my responsibilities (>1000 ETL and report batch jobs) if the programs would not be easy to maintain (or write in the first place, thanks to my collection of custom macros).

Radwan
Quartz | Level 8

@Kurt_Bremser  thanks i have check my dataset and i found a repetition  as you said 

now it is working well 

regards 

Radwan
Quartz | Level 8
could you please have a look on my new question about how to match two datasets ?
i need help
Radwan
Quartz | Level 8
agreed

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2035 views
  • 5 likes
  • 3 in conversation