BookmarkSubscribeRSS Feed
FriedEgg
SAS Employee

I reworked an alternate solution using you provided data on a personal machine.

/* remove a few duplicates that existed using your statement that records should be unique by tic and fyear */

proc sort data=in.comp out=comp nodupkey; by tic fyear; run;

/* make sure you update your zipcode dataset using the information provided by Patrick to reduce the number of rejected records */

proc sql;

create table foobar as

select a.zip,

        a.tic,

                    a.fyear,

                    a.sic3,

        sum(zipcitydistance(a.zip,c.zip)<60) as num_industry,

                    log(1 + calculated num_industry) as density_industry

   from comp a,

        sashelp.zipcode b,

                    comp c

  where a.zip=b.zip and c.zip=b.zip and a.tic ne c.tic and a.fyear=c.fyear and a.sic3=c.sic3

  group by a.zip,a.tic,a.fyear,a.sic3

  order by a.tic,a.fyear;

quit;

/* bring back the rejected records, leave the num_industry and density_industry as null or set to a desired other value */

data comp;

merge comp(in=b) foobar(in=a);

by tic fyear;

if a or (b and not a) then output;

run;

Full program runtime 2.87 seconds on my system.

mspak
Quartz | Level 8

Thanks FriedEgg,

I am thinking what is the purpose to have the source tables of "comp" 2 times (ie. comp a and comp c)? Is it for the matching purpose. I can see you set the where clause - a.tic ne c.tic & a.sic3=c.sic3. Does the "a.tic ne c.tic" indirectly assume the "comp a" as the list of firms that need matching and "com c" as the firms to be matched.  Then, matching criteria can be indirectly identified by the where clause "a.sic3=c.sic3". 

Processing time can be saved if we can set the criteria in the SQL wisely. SAS Hash object is also a high technology in SAS program.

Thanks for sharing. I learnt something from you too Smiley Happy

Regards,

MSPAK

FriedEgg
SAS Employee

The comp dataset is referenced twice to perform a self join, as you assumed.  'comp a' as you say will be the set of reference here and 'comp c' will be applied in similar fashion to a left outer join on year and sic3.  The reason for the tic condition was to avoid matching a record to itself.  In this situation SQL is much faster than the proposed hash solution due to the order of processing and the removal of the invalid zips before the density calculation.

mspak
Quartz | Level 8

Thanks FriedEgg for confirmation.

I am new to SAS and have used the program for merely 2 - 3 months. I have another question on the WHERE clause:

Do you mean that if the WHERE clause of "a.tic ne c.tic" is not available, then the firms (identified by tic) will be self-matched? If I never indicated them in the WHERE clause as "a.tic=c.tic" would this also generate the same result?

As per the following WHERE clause:

a.zip=b.zip and c.zip=b.zip and you never indicate that a.zip=c.zip, does it indirectly indicate a.zip ne c.zip?

In short, my question is that any difference if the "ne" indicated in WHERE clause compared to have no such a clause?

Once again, thank you very much for helps.

Regards,

mspak

FriedEgg
SAS Employee

a.tic ne c.tic -- match all values of tic other than itself

a.tic eq c.tic -- match only itself

not mentioning a.tic or c.tic -- match all values of tic to all values of tic

a.zip=b.zip -- this is used to validate the zipcodes in 'comp a' if a.zip is not present in 'sashelp.zipcode b' then the record is removed

b.zip=c.zip -- this is used to validate the zipcodes in 'comp c' if a.zip is not present in 'sashelp.zipcode b' then the record is removed

a.zip=c.zip -- this would only match firms in the same zip to each other for density calculation, which would be wrong.  It does not indirectly imply a.zip ne c.zip which would match all firms except those in the same zip code to eachother, also wrong.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 1562 views
  • 7 likes
  • 4 in conversation