Help using Base SAS procedures

Hash object

Reply
Regular Contributor
Posts: 162

Hash object

Dear all,

 

The following SQL program is to calculate the market share difference between an audit firmwith its closest (in terms of distance) competitor in the audit market. The step 4 below - to find the closest audit firm is too time consuming. I would be much appreciate if anyone can find a solution that can save the processing time, possibly in hash object. 

 

*step 1: calculate the total fee of each auditor in the market;

proc sql;
create table sum1 as
select *,sum(audit_fees) as fee_sum
from cluster.forum
group by auditor_fkey, fyear,sic2,cbsa;
quit;

*step 2: calculate the total fees charged in the market;
proc sql;
create table sum2 as
select *,sum(audit_fees) as market_fee
from sum1
group by fyear,sic2,cbsa;
quit;

*step 3: calculate the market share of each auditr in the market;

data sum3;
set sum2;
fee_shr = fee_sum/market_fee;
run;

*step 4: calculate the closest competitors in the market
(market size is depend on CBSA, SIC2 and fyear);

proc sql;
create table cluster.dist as
select a.*, min(zipcitydistance(a.zip1,b.zip1)) as min_dist
from sum3 a , sum3 b
group by a.fyear, a.sic2, a.cbsa
having not missing(calculated min_dist);
quit;

*step5: only output the data with closest competitors in the market;

 

 

Thank you in advance.

 

Regadrs,

MSPAK

Attachment
Super User
Super User
Posts: 7,942

Re: Hash object

[ Edited ]

Hi,

 

There is no test data given (datastep please) so not able to code.  However if you find SQL is taking to long, consider converting it to a datastep.  None of those proc sql steps are doing anything particularly difficult to in datastep, or in fact using the procedures supplied by Base SAS.  For instance, your first two steps could be done in one datastep.  The last step, it seems to be working on all records from sum3 * sum3, which could lead to lots of data merging together.  Bit difficult to say without some test data to run it on.

data sum3;
  set cluster.forum;
  retain fee_sum market_fee fee_shr;
  by auditor_fkey fyear sic2 cbsa;
  /* Note I am assuming auditor_fkey is the main group here */
  if first.auditor_fkey then fee_sum=audit_fees;
  else fee_sum=sum(fee_sum,audit_fees);
  if first.fyear then fee_shr=audit_fees;
  else fee_shr=sum(fee_shr,audit_fees);
  if last.auditor_fkey then do;
    fee_shr=fee_sum/market_fee;
    output;
  end;
run;
  
Regular Contributor
Posts: 162

Re: Hash object

Hi,

Thank you for your mail. I attached the dataset in zip file in my previous post. My first 3 steps can be done without any problem, but not for the step 4. I understand the hash object can help much, but I find the codes in hash objects are difficult. I am seeking for any helps in coding, particularly in hash object or any programs that can reduce processing time.

Thank you.
Super User
Super User
Posts: 7,942

Re: Hash object

Sorry, am not downloading attachments.  Code converted for the first 3 steps is given, same can be applied to the last one, with your data.  I would look specifically at this part:

from sum3 a , sum3 b
group by a.fyear, a.sic2, a.cbsa
having not missing(calculated min_dist);

 

This is creating a table of all rows from sum3 * all rows from sum3.  This creates lots of data, and I don't know why it is programmed that way.  Break it out into separate steps if you have to, i.e. create a dataset with the records you want to see, (just do the merge).  How many records is that?  

Regular Contributor
Posts: 162

Re: Hash object

Hi again,

 

Thank you for your code. I can process the data using SQL for the first 3 steps, but not for the step 4. 

 

The data sum3 should includes the market share for each audit firm in the market. The audit market boundary is defined by sic2 (industrry codes), fyear (years) and CBSA. CBSA is Core Based Statistical Area (refer https://en.wikipedia.org/wiki/Core-based_statistical_area). 

 

I would like to identify the market share difference between an audit firm with its closest competitor in the market. Zip codes (zip1) is given the dataset. zipcitydistance function should be used (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003113156.htm) to be used for calculating the distance between 2 zipcodes in USA. 

 

It is one to many matching, but I only wish to output the obs with closest competitor. The output should contain the unique auditor_fkey (auditor identification code) with their closest competitor in the market.

 

Thank you.

 

Regards,

mspak

 

Regular Contributor
Posts: 162

Re: Hash object

When I perform the SQL on step 4, the following message recorded:

NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.

NOTE: The query requires remerging summary statistics back with the original data.

 

I am asking for helps in solving the problems of huge data as a result of Cartesian joins. 

 

Thank you. 

 

Regards,

MSPAK

Regular Contributor
Posts: 162

Re: Hash object

The following perhaps can work:

*step 1: calculate the total fee of each auditor in the market;

proc sql;
create table sum1 as
select *,sum(audit_fees) as fee_sum
  from cluster.forum
   group by auditor_fkey, fyear,sic2,cbsa;
quit;

*step 2: calculate the total fees charged in the market;
proc sql;
create table sum2 as
select *,sum(audit_fees) as market_fee
  from sum1
   group by fyear,sic2,cbsa;
quit;

*step 3: calculate the market share of each auditr in the market;

data sum3;
set sum2;
fee_shr = fee_sum/market_fee;
run;

*step 4: calculate the absolute market share difference between current auditor & closest competitors in the market
(market size is depend on CBSA, SIC2 and fyear);


data cluster.dist (drop=_Smiley Happy;
if _n_=1 then do;
set sum3( rename=(zip1=_zip1) obs=1);
dcl hash h(dataset: 'sum3 (rename=(zip1=_zip1 fee_shr = _fee_shr))', multidata: 'yes');
h.definekey('fyear','sic2', 'cbsa');
h.definedata(all: 'YES');
h.definedone();
end;

set sum3;
_ct=0;
_rc=h.find();
do while (_rc=0);
distance = zipcitydistance(_zip1,zip1);
shr_diff = abs(sum(fee_shr, - _fee_shr));
if zipcitydistance(_zip1,zip1) =min(zipcitydistance(_zip1,zip1))
and abs(sum(fee_shr, - _fee_shr)) = min (abs(sum(fee_shr, - _fee_shr)))
then _ct1+1;
h.has_next(result: _r);
if _r ne 0 then _rc=h.find_next();
else leave;
end;

run;

Ask a Question
Discussion stats
  • 6 replies
  • 429 views
  • 0 likes
  • 2 in conversation