Dear all,
The following SQL program is to calculate the market share difference between an audit firmwith its closest (in terms of distance) competitor in the audit market. The step 4 below - to find the closest audit firm is too time consuming. I would be much appreciate if anyone can find a solution that can save the processing time, possibly in hash object.
*step 1: calculate the total fee of each auditor in the market;
proc sql;
create table sum1 as
select *,sum(audit_fees) as fee_sum
from cluster.forum
group by auditor_fkey, fyear,sic2,cbsa;
quit;
*step 2: calculate the total fees charged in the market;
proc sql;
create table sum2 as
select *,sum(audit_fees) as market_fee
from sum1
group by fyear,sic2,cbsa;
quit;
*step 3: calculate the market share of each auditr in the market;
data sum3;
set sum2;
fee_shr = fee_sum/market_fee;
run;
*step 4: calculate the closest competitors in the market
(market size is depend on CBSA, SIC2 and fyear);
proc sql;
create table cluster.dist as
select a.*, min(zipcitydistance(a.zip1,b.zip1)) as min_dist
from sum3 a , sum3 b
group by a.fyear, a.sic2, a.cbsa
having not missing(calculated min_dist);
quit;
*step5: only output the data with closest competitors in the market;
Thank you in advance.
Regadrs,
MSPAK
Hi,
There is no test data given (datastep please) so not able to code. However if you find SQL is taking to long, consider converting it to a datastep. None of those proc sql steps are doing anything particularly difficult to in datastep, or in fact using the procedures supplied by Base SAS. For instance, your first two steps could be done in one datastep. The last step, it seems to be working on all records from sum3 * sum3, which could lead to lots of data merging together. Bit difficult to say without some test data to run it on.
data sum3; set cluster.forum; retain fee_sum market_fee fee_shr; by auditor_fkey fyear sic2 cbsa; /* Note I am assuming auditor_fkey is the main group here */ if first.auditor_fkey then fee_sum=audit_fees; else fee_sum=sum(fee_sum,audit_fees); if first.fyear then fee_shr=audit_fees; else fee_shr=sum(fee_shr,audit_fees); if last.auditor_fkey then do; fee_shr=fee_sum/market_fee; output; end; run;
Sorry, am not downloading attachments. Code converted for the first 3 steps is given, same can be applied to the last one, with your data. I would look specifically at this part:
from sum3 a , sum3 b
group by a.fyear, a.sic2, a.cbsa
having not missing(calculated min_dist);
This is creating a table of all rows from sum3 * all rows from sum3. This creates lots of data, and I don't know why it is programmed that way. Break it out into separate steps if you have to, i.e. create a dataset with the records you want to see, (just do the merge). How many records is that?
Hi again,
Thank you for your code. I can process the data using SQL for the first 3 steps, but not for the step 4.
The data sum3 should includes the market share for each audit firm in the market. The audit market boundary is defined by sic2 (industrry codes), fyear (years) and CBSA. CBSA is Core Based Statistical Area (refer https://en.wikipedia.org/wiki/Core-based_statistical_area).
I would like to identify the market share difference between an audit firm with its closest competitor in the market. Zip codes (zip1) is given the dataset. zipcitydistance function should be used (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003113156.htm) to be used for calculating the distance between 2 zipcodes in USA.
It is one to many matching, but I only wish to output the obs with closest competitor. The output should contain the unique auditor_fkey (auditor identification code) with their closest competitor in the market.
Thank you.
Regards,
mspak
When I perform the SQL on step 4, the following message recorded:
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
I am asking for helps in solving the problems of huge data as a result of Cartesian joins.
Thank you.
Regards,
MSPAK
The following perhaps can work:
*step 1: calculate the total fee of each auditor in the market;
proc sql;
create table sum1 as
select *,sum(audit_fees) as fee_sum
from cluster.forum
group by auditor_fkey, fyear,sic2,cbsa;
quit;
*step 2: calculate the total fees charged in the market;
proc sql;
create table sum2 as
select *,sum(audit_fees) as market_fee
from sum1
group by fyear,sic2,cbsa;
quit;
*step 3: calculate the market share of each auditr in the market;
data sum3;
set sum2;
fee_shr = fee_sum/market_fee;
run;
*step 4: calculate the absolute market share difference between current auditor & closest competitors in the market
(market size is depend on CBSA, SIC2 and fyear);
data cluster.dist (drop=_:);
if _n_=1 then do;
set sum3( rename=(zip1=_zip1) obs=1);
dcl hash h(dataset: 'sum3 (rename=(zip1=_zip1 fee_shr = _fee_shr))', multidata: 'yes');
h.definekey('fyear','sic2', 'cbsa');
h.definedata(all: 'YES');
h.definedone();
end;
set sum3;
_ct=0;
_rc=h.find();
do while (_rc=0);
distance = zipcitydistance(_zip1,zip1);
shr_diff = abs(sum(fee_shr, - _fee_shr));
if zipcitydistance(_zip1,zip1) =min(zipcitydistance(_zip1,zip1))
and abs(sum(fee_shr, - _fee_shr)) = min (abs(sum(fee_shr, - _fee_shr)))
then _ct1+1;
h.has_next(result: _r);
if _r ne 0 then _rc=h.find_next();
else leave;
end;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.