BookmarkSubscribeRSS Feed
mspak
Quartz | Level 8

Dear all,

I have file with the follwoign variables:

tic: company identification code

fyear: financial year

sic3: 3-digit SIC Industry (ie. indistry classification code with the first 3 digits)

zip: US 5-digit zipcode

I wish to calculate a variable, density of industrial firms, which is defined as log of one plus the number of firms in the same 3-digit SIC industry within 60-mile radius of the firms headquarters (tic).  

NUM_INDUSTRY = Number of firms in the same 3-digit SIC industry within 60-mile radius of the firms headquarters

DENSITY_INDUSTRY = LOG (1 + NUM_INDUSTRY)

How should I can pairs of firms with the same 3-digit SIC code? I know the way to calculate distances and number of firms within 60-mile radius. However, before, I can calculate the distance, I must have a list of firms with the same 3-digit SIC code for each firm-year (ie. thsi does mean for each firm every year).

Thank you.

Note: the SAS datafile attached.

mspak

19 REPLIES 19
Patrick
Opal | Level 21

What identifies a headquarter?

From your narrative I would expect to get a variable which identifies the headquarter (so only populated for records which are headquarter), a clustering variable (eg. TIC) and then a firm identifier (eg. SIC).

The data you provide don't give me this headquarter identifier. On the other hand there are a lot of duplicates in it.

proc sql;
  select count(*) as NObs, *
  from mspak.comp
  group by fyear,tic
  having count(*)>1
  ;
quit;

NObs  zip    tic      fyear SIC3
2     08543  0388B    2003  284
2     08543  0388B    2003  284
...
5     94105  3DOVRA   2003  153
5     94105  3DOVRA   2003  153
5     94105  3DOVRA   2003  153
5     94105  3DOVRA   2003  153
5     94105  3DOVRA   2003  153

.....

Are you sure your source data is as it should be? And if yes could you explain how to identify the headquarter and why there are duplicates?

mspak
Quartz | Level 8

Hi Patrick,

The datafile contains all firms with their hardquarters' zipcode in US. Let say, for example the following:

TIC FYEAR SIC ZIP

AA  2003   246  XXXXX

BB  2003   246  XXXXX

CC  2003   246  XXXXX

We assume that firms with SIC code of 246 only 3 firms. Then I have to calculate the distance between:

AA - BB

AA - CC

If the distance between AA-BB and AA-CC are within 60 mile-radius, then industry density for firm AA = log (2+1)

For firm BB,

I have to calculate distance between:

BB-AA

BB-CC

If only distance between BB-AA are within 60 mile-radius, the density = LOG (1+1)

AND SO FORTH.

mspak
Quartz | Level 8

Hi again,

The observations are unique for each firm-year. Each firm is possible to have same zipcode for every year. I couldn't remain only to have unique zip, tic and sic3 as firms might change their headquarters to another area which has different zipcodes, as such, you might see them repeated.

Thank you and I hope my explanation is clear.

mspak

mspak
Quartz | Level 8

Duplicates to be eliminated.

The uniqueness to be identified by

1. tic

2. fyear

Patrick
Opal | Level 21

Looks to me like a hash iter object approach could solve this in one data step. I'll post some code as soon as I have something working.

mspak
Quartz | Level 8

Thanks in advance for your helps.

Firm-year's zipcodes are origin zipcodes and the destination zipcode are other firms (other than origin) which are in the same SIC3.

Firm-years have chance to be origin and destination among themselves for the same SIC3.

Regards,

mspak

Patrick
Opal | Level 21

Can't provide you the final solution yet but need to stop for today. Thought I share with what I could come up so far. May be this helps someone else to come up with a solution.

I decided to go for a hash iter approach over a SQL approach because I believe it possibly will perform better.

First issue I had was an outdated sashelp.zipcode table. Function zipcitydistance() doesn't like it too much if it's fed with zipcodes which are not in this table.

A updated table can be obtained here: http://support.sas.com/rnd/datavisualization/mapsonline/html/misc.html

In order to continue working I excluded all zipcodes in your source data not in sashelp.zipcode (that's the where clause in the SQL which needs to be removed for a final version).

Where I'm currently stuck is how the iter object behaves. In my understanding there should be at least one match for every iteration of the data step (that's where a row matches with itself, so AA-AA). But in the 4th input row I get no match (variable count ends up to be zero - and therefore the log() falls over). Not sure yet why this happens and if it is because I'm missunderstanding how the iter object works - or if it is a bug (I was a bit lazy in applying patches so it is a possibility).

The code below should run without syntax errors and show the issue I'm currently having with the 4th obs from the input data set.

There is some comment missing in the code.... I haven't finished yet. Hope it's anyway of some use.


options ls=200;
libname mspak 'C:\Users\ssapam\Documents\My SAS Files\9.2';

proc sql;
  create table comp_2 as
    select distinct fyear,sic3,tic,zip
    from mspak.comp
    where zip in (select zip from sashelp.zipcode)  /* zipcitydistance() only works if zipcode exists in sashelp.zipcode */
    order by fyear,sic3
  ;
quit;

data want;
  set comp_2(rename=(tic=ThisTic zip=ThisZip) obs=4 firstobs=4 )
      comp_2(obs=0) /*needed for variable mapping for hash object */
    ;
  by fyear sic3;
  if _n_=1 then
  do;
    declare hash h1 (dataset:'work.comp_2',ordered:'a');
    _rc=h1.defineKey('fyear','sic3');
    _rc=h1.defineData('tic','zip');
    _rc=h1.defineDone();
    declare hiter hit1('h1');
  end;

  count=0;
  ThisFyear=fyear;
  ThisSic3=sic3;
  _rc=hit1.first();
  _rc=h1.find();

  do while(_rc=0);
    count+ (zipcitydistance(ThisZip, zip)<=60);
    distance=zipcitydistance(ThisZip, zip);
        /* for debugging. only uncomment together with obs=2 in set statement! */
        put (fyear sic3 tic zip) (=) /*@60 (ThisFyear ThisSic3 ThisTic ThisZip count distance) (=)*/;
    _rc=hit1.next();
    if ThisFyear ne fyear or ThisSic3 ne sic3 then leave;
  end;

  mutual_density = log (count);

run;

mspak
Quartz | Level 8

Thanks Patrick,

I need time to digest your program suggested as I am new to SQL. I will test run the program, I am not sure will this take long time for my notebook to process it?

Good nite (11 march 2012, 11.09pm now)

Mspak.

Haikuo
Onyx | Level 15

Patrick's direction is definitely applaudable. I have tried SQL first, and stop it after 30-mins of running. Here is my hash approach, I think we may not need to use hiter, as both table are identical.

After making sure that all of the zipcodes are included in the sashelp.zipcodes,

407  data want (drop=_:);

408     if _n_=1 then do;

409        set comp( rename=(tic=_tic zip=_zip) obs=1);

410        dcl hash h(dataset: 'comp (rename=(tic=_tic zip=_zip))', multidata: 'yes');

411        h.definekey('fyear','sic3');

412        h.definedata('_tic','_zip');

413        h.definedone();

414      end;

415

416        set comp;

417           _ct=0;

418           _rc=h.find();

419            do while (_rc=0);

420              if .<zipcitydistance(_zip,zip) <=60 then _ct+1;

421               h.has_next(result: _r);

422               if _r ne 0 then _rc=h.find_next();

423               else leave;

424            end;

425            DENSITY_INDUSTRY=log(_ct);

426

427  run;

NOTE: There were 66452 observations read from the data set WORK.COMP.

NOTE: There were 1 observations read from the data set WORK.COMP.

NOTE: There were 66452 observations read from the data set WORK.COMP.

NOTE: The data set WORK.WANT has 66452 observations and 5 variables.

NOTE: DATA statement used (Total process time):

      real time           5:26.81

      cpu time            5:17.99

Less than 6mins, Not too shady. BTW, the has_next() method is not a must-have here. You can just use:

data want (drop=_:);

   if _n_=1 then do;

      set comp( rename=(tic=_tic zip=_zip) obs=1);

      dcl hash h(dataset: 'comp (rename=(tic=_tic zip=_zip))', multidata: 'yes');

      h.definekey('fyear','sic3');

      h.definedata('_tic','_zip');

      h.definedone();

      dcl hiter hi('h');

    end;

      set comp ;

         _ct=0;

         _rc=h.find();

          do while (_rc=0);

            if .<zipcitydistance(_zip,zip) <=60 then _ct+1;

             _rc=h.find_next();

          end;

          DENSITY_INDUSTRY=log(_ct);

run;

I was hoping has_next will help the efficiency, well, it turned out no difference.

Regards,

Haikuo

Patrick
Opal | Level 21

Thanks Hai.kuo!

I wasn't aware of the possibilities given by 'multidata' together with find_next()

Learned something 🙂

@mspak

Go for Hai.kuo's second option. It looks pretty neat to me.

mspak
Quartz | Level 8

Thanks HaiKuo and Patrick,

Both hiter and hash objects are new to me. Thanks for sharing. I learned something new here.

The hash object provides a fast, easy way to perform lookups without sorting or indexing in data step. I think I have to take time to digest the program again by further reading on this method.

Running the program in data step within 6 minutes is great for me!!! I always took a few hours to run the program in SQL.

Will update in this discussion forum again in future if it proves successful.

Regards,

mspak

mspak
Quartz | Level 8

Hi Hai.Kuo,

I ran the program and it is efficient. As the definition of industry density is log(1+ number of firms located within 60-mile radius in the same industry, should I change the DENSITY_INDUSTRY as log (1+ct) instead of log(_ct)?

Thank you.

Regards,

mspak

Haikuo
Onyx | Level 15

Not by your definitions. _ct has already included the self-match, which by your definition is 1+count of match to others, so you don't need to use 1+_Ct to replace _ct in this context.

Haikuo

mspak
Quartz | Level 8

Thank you very much Smiley Happy

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 1566 views
  • 7 likes
  • 4 in conversation