Solved: Re: why i get too many rows

Radwan · Posted 03-11-2019 02:32 AM

Hello

when i use proc sql; to create tables sometimes i get too many observations for instance

i used this code

proc sql;
create table control_vars as select a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_executives,d._The_total_remuneration_of_the_f,
d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.LocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
i.BTM1,i.Enterpris__Value_Multiplier,j.big4
from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location as f, roa_roe as g, soe as h,
indictors as i, big4 as j
where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
quit;

then i took around 10 minutes till finished the statement then i got the table that contains the following

79 proc sql;
80 create table control_vars as select
80 ! a.code,a.year,a.Total_assets,a.Recievables,a.total_liabilities,a.inventory,a.Equity,
81 b.Firm_Age,c.indusCode,d.boardsize,d.number_of_independent_directors,d.Number_of_exec
81 ! utives,d._The_total_remuneration_of_the_f,
82 d.Total_Top_Three_Compensation,d.Duality,d.totalSalary,e.Number_of_Board_Meetings,f.L
82 ! ocationCode,g.roa,g.roe,h.soe,i.TOBINQ1,
83 i.BTM1,i.Enterpris__Value_Multiplier,j.big4
84 from balance as a, firm_age as b, industry as c, boards as d, meeting as e, location
84 ! as f, roa_roe as g, soe as h,
85 indictors as i, big4 as j
86 where a.code=b.code=c.code=d.code=e.code=f.code=g.code=i.code=j.code and
86 ! a.year=d.year=e.year=g.year=h.year2=i.year=j.year2;
NOTE: Table WORK.CONTROL_VARS created, with 54165899 rows and 25 columns.

87 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 11:13.68
cpu time 3:26.12

the issue the program be so slow

please i need a solution

Kurt_Bremser · Posted 03-11-2019 02:50 AM

When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:

data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;

data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;

proc sql;
create table want as
select
  a.year,
  a.code,
  a.value1,
  b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;

proc print data=want noobs;
run;

The result:

year    code    value1    value2

2019      1        1         3  
2019      1        1         4  
2019      1        2         3  
2019      1        2         4

shows that you get 2 * 2 observations.

Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.

Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?

Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.

Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 03-11-2019 02:50 AM

When SQL creates a much larger number of observations than expected, it is usually because the datasets have more than one repeat of the join condition, and then SQL creates a "cartesian join". See this example:

data have1;
input year code value1;
cards;
2019 1 1
2019 1 2
;
run;

data have2;
input year code value2;
cards;
2019 1 3
2019 1 4
;
run;

proc sql;
create table want as
select
  a.year,
  a.code,
  a.value1,
  b.value2
from have1 a, have2 b
where a.year = b.year and a.code=b.code
;
quit;

proc print data=want noobs;
run;

The result:

year    code    value1    value2

2019      1        1         3  
2019      1        1         4  
2019      1        2         3  
2019      1        2         4

shows that you get 2 * 2 observations.

Start by implementing Maxim 3 (Know Your Data) and see how many repeats of year/code combinations you have in the datasets.

Then decide how you want to treat multiple occurences for each individual dataset. Maybe you have an additional columns that identfies matches?

Lastly, proc sql is a known resource hog when doing (large-scale) joins. All data from the tables is dumped into a utility file, from which the matches are read with very many random accesses, which will slow down your disks, especially if they're not SSD's but spinning metal.

Consider sorting your datasets and doing a data step merge, but only after you know your datasets (see above), as a data step merge behaves different than SQL when multiple repeats are present in more than 1 input dataset (no cartesian join).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

CJac73 · Posted 03-12-2019 06:42 PM

KirkBresmer, I like the way you code, easy to read.

Kurt_Bremser · Posted 03-13-2019 02:15 AM

@CJac73 wrote:
KirkBresmer, I like the way you code, easy to read.

That's covered in another Maxim (#12). Consistent code layout with a clear visability of functional blocks is a great help in understanding, debugging and maintaining code. I would not be able to handle my responsibilities (>1000 ETL and report batch jobs) if the programs would not be easy to maintain (or write in the first place, thanks to my collection of custom macros).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Radwan · Posted 03-13-2019 04:48 AM

@Kurt_Bremser thanks i have check my dataset and i found a repetition as you said

now it is working well

regards

Radwan · Posted 03-13-2019 04:51 AM

could you please have a look on my new question about how to match two datasets ?
i need help

Radwan · Posted 03-13-2019 04:44 AM

agreed

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away