Dear all,
I have file with the follwoign variables:
tic: company identification code
fyear: financial year
sic3: 3-digit SIC Industry (ie. indistry classification code with the first 3 digits)
zip: US 5-digit zipcode
I wish to calculate a variable, density of industrial firms, which is defined as log of one plus the number of firms in the same 3-digit SIC industry within 60-mile radius of the firms headquarters (tic).
NUM_INDUSTRY = Number of firms in the same 3-digit SIC industry within 60-mile radius of the firms headquarters
DENSITY_INDUSTRY = LOG (1 + NUM_INDUSTRY)
How should I can pairs of firms with the same 3-digit SIC code? I know the way to calculate distances and number of firms within 60-mile radius. However, before, I can calculate the distance, I must have a list of firms with the same 3-digit SIC code for each firm-year (ie. thsi does mean for each firm every year).
Thank you.
Note: the SAS datafile attached.
mspak
What identifies a headquarter?
From your narrative I would expect to get a variable which identifies the headquarter (so only populated for records which are headquarter), a clustering variable (eg. TIC) and then a firm identifier (eg. SIC).
The data you provide don't give me this headquarter identifier. On the other hand there are a lot of duplicates in it.
proc sql;
select count(*) as NObs, *
from mspak.comp
group by fyear,tic
having count(*)>1
;
quit;
NObs zip tic fyear SIC3
2 08543 0388B 2003 284
2 08543 0388B 2003 284
...
5 94105 3DOVRA 2003 153
5 94105 3DOVRA 2003 153
5 94105 3DOVRA 2003 153
5 94105 3DOVRA 2003 153
5 94105 3DOVRA 2003 153
.....
Are you sure your source data is as it should be? And if yes could you explain how to identify the headquarter and why there are duplicates?
Hi Patrick,
The datafile contains all firms with their hardquarters' zipcode in US. Let say, for example the following:
TIC FYEAR SIC ZIP
AA 2003 246 XXXXX
BB 2003 246 XXXXX
CC 2003 246 XXXXX
We assume that firms with SIC code of 246 only 3 firms. Then I have to calculate the distance between:
AA - BB
AA - CC
If the distance between AA-BB and AA-CC are within 60 mile-radius, then industry density for firm AA = log (2+1)
For firm BB,
I have to calculate distance between:
BB-AA
BB-CC
If only distance between BB-AA are within 60 mile-radius, the density = LOG (1+1)
AND SO FORTH.
Hi again,
The observations are unique for each firm-year. Each firm is possible to have same zipcode for every year. I couldn't remain only to have unique zip, tic and sic3 as firms might change their headquarters to another area which has different zipcodes, as such, you might see them repeated.
Thank you and I hope my explanation is clear.
mspak
Duplicates to be eliminated.
The uniqueness to be identified by
1. tic
2. fyear
Looks to me like a hash iter object approach could solve this in one data step. I'll post some code as soon as I have something working.
Thanks in advance for your helps.
Firm-year's zipcodes are origin zipcodes and the destination zipcode are other firms (other than origin) which are in the same SIC3.
Firm-years have chance to be origin and destination among themselves for the same SIC3.
Regards,
mspak
Can't provide you the final solution yet but need to stop for today. Thought I share with what I could come up so far. May be this helps someone else to come up with a solution.
I decided to go for a hash iter approach over a SQL approach because I believe it possibly will perform better.
First issue I had was an outdated sashelp.zipcode table. Function zipcitydistance() doesn't like it too much if it's fed with zipcodes which are not in this table.
A updated table can be obtained here: http://support.sas.com/rnd/datavisualization/mapsonline/html/misc.html
In order to continue working I excluded all zipcodes in your source data not in sashelp.zipcode (that's the where clause in the SQL which needs to be removed for a final version).
Where I'm currently stuck is how the iter object behaves. In my understanding there should be at least one match for every iteration of the data step (that's where a row matches with itself, so AA-AA). But in the 4th input row I get no match (variable count ends up to be zero - and therefore the log() falls over). Not sure yet why this happens and if it is because I'm missunderstanding how the iter object works - or if it is a bug (I was a bit lazy in applying patches so it is a possibility).
The code below should run without syntax errors and show the issue I'm currently having with the 4th obs from the input data set.
There is some comment missing in the code.... I haven't finished yet. Hope it's anyway of some use.
options ls=200;
libname mspak 'C:\Users\ssapam\Documents\My SAS Files\9.2';
proc sql;
create table comp_2 as
select distinct fyear,sic3,tic,zip
from mspak.comp
where zip in (select zip from sashelp.zipcode) /* zipcitydistance() only works if zipcode exists in sashelp.zipcode */
order by fyear,sic3
;
quit;
data want;
set comp_2(rename=(tic=ThisTic zip=ThisZip) obs=4 firstobs=4 )
comp_2(obs=0) /*needed for variable mapping for hash object */
;
by fyear sic3;
if _n_=1 then
do;
declare hash h1 (dataset:'work.comp_2',ordered:'a');
_rc=h1.defineKey('fyear','sic3');
_rc=h1.defineData('tic','zip');
_rc=h1.defineDone();
declare hiter hit1('h1');
end;
count=0;
ThisFyear=fyear;
ThisSic3=sic3;
_rc=hit1.first();
_rc=h1.find();
do while(_rc=0);
count+ (zipcitydistance(ThisZip, zip)<=60);
distance=zipcitydistance(ThisZip, zip);
/* for debugging. only uncomment together with obs=2 in set statement! */
put (fyear sic3 tic zip) (=) /*@60 (ThisFyear ThisSic3 ThisTic ThisZip count distance) (=)*/;
_rc=hit1.next();
if ThisFyear ne fyear or ThisSic3 ne sic3 then leave;
end;
mutual_density = log (count);
run;
Thanks Patrick,
I need time to digest your program suggested as I am new to SQL. I will test run the program, I am not sure will this take long time for my notebook to process it?
Good nite (11 march 2012, 11.09pm now)
Mspak.
Patrick's direction is definitely applaudable. I have tried SQL first, and stop it after 30-mins of running. Here is my hash approach, I think we may not need to use hiter, as both table are identical.
After making sure that all of the zipcodes are included in the sashelp.zipcodes,
407 data want (drop=_:);
408 if _n_=1 then do;
409 set comp( rename=(tic=_tic zip=_zip) obs=1);
410 dcl hash h(dataset: 'comp (rename=(tic=_tic zip=_zip))', multidata: 'yes');
411 h.definekey('fyear','sic3');
412 h.definedata('_tic','_zip');
413 h.definedone();
414 end;
415
416 set comp;
417 _ct=0;
418 _rc=h.find();
419 do while (_rc=0);
420 if .<zipcitydistance(_zip,zip) <=60 then _ct+1;
421 h.has_next(result: _r);
422 if _r ne 0 then _rc=h.find_next();
423 else leave;
424 end;
425 DENSITY_INDUSTRY=log(_ct);
426
427 run;
NOTE: There were 66452 observations read from the data set WORK.COMP.
NOTE: There were 1 observations read from the data set WORK.COMP.
NOTE: There were 66452 observations read from the data set WORK.COMP.
NOTE: The data set WORK.WANT has 66452 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 5:26.81
cpu time 5:17.99
Less than 6mins, Not too shady. BTW, the has_next() method is not a must-have here. You can just use:
data want (drop=_:);
if _n_=1 then do;
set comp( rename=(tic=_tic zip=_zip) obs=1);
dcl hash h(dataset: 'comp (rename=(tic=_tic zip=_zip))', multidata: 'yes');
h.definekey('fyear','sic3');
h.definedata('_tic','_zip');
h.definedone();
dcl hiter hi('h');
end;
set comp ;
_ct=0;
_rc=h.find();
do while (_rc=0);
if .<zipcitydistance(_zip,zip) <=60 then _ct+1;
_rc=h.find_next();
end;
DENSITY_INDUSTRY=log(_ct);
run;
I was hoping has_next will help the efficiency, well, it turned out no difference.
Regards,
Haikuo
Thanks Hai.kuo!
I wasn't aware of the possibilities given by 'multidata' together with find_next()
Learned something 🙂
@mspak
Go for Hai.kuo's second option. It looks pretty neat to me.
Thanks HaiKuo and Patrick,
Both hiter and hash objects are new to me. Thanks for sharing. I learned something new here.
The hash object provides a fast, easy way to perform lookups without sorting or indexing in data step. I think I have to take time to digest the program again by further reading on this method.
Running the program in data step within 6 minutes is great for me!!! I always took a few hours to run the program in SQL.
Will update in this discussion forum again in future if it proves successful.
Regards,
mspak
Hi Hai.Kuo,
I ran the program and it is efficient. As the definition of industry density is log(1+ number of firms located within 60-mile radius in the same industry, should I change the DENSITY_INDUSTRY as log (1+ct) instead of log(_ct)?
Thank you.
Regards,
mspak
Not by your definitions. _ct has already included the self-match, which by your definition is 1+count of match to others, so you don't need to use 1+_Ct to replace _ct in this context.
Haikuo
Thank you very much
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.