BookmarkSubscribeRSS Feed
mspak
Quartz | Level 8

Good days to all,

I recently used SQL to generate possible pairs of geograhic zipcodes between one location to another. I have about 13,000 firms per year for 8 years. Therefore, number of firm-years is estimated to be 104,000. Then, I have to calculate the distance between every firm-year to all of the mutual fund companies (which is about 17,000 firms for each year). By the end, I have to calculate mutual fund density, which is defined as log (1+mutual fund companies with distance within 60 miles radius from the firm).    

The submitted program was run for 7 hours but I failed to generate output due to insufficient spaces. My notebook has free space about 365 GB but yet insufficient for storing the output. I am looking forward any response which can solve the program.

My concerns:

1. How to submit the program in batches?

2. How can I estimate the processing time. I couldn't find any indicator on the estimated time and storage needed for running programs.

3. Any recommendations and suggestions...etc.

The following is my program (with libname "geog"):

proc sql;

create table geog.fundcom as

select a.fyear, tic, crsp_fundn, zipcitydistance(zipcode, zip)as mutualdist

from geog.comp a , geog.Mutualcom b

where a.fyear=b.year

group by a.fyear,tic;

having not missing(calculated mutualdist);

quit;

proc sql;

create table geog.mutualnum as

select fyear, tic, sum(0<=mutualdist<=60) as mutualnum

from geog.fundcom

group by tic, fyear;

quit; run;

data geog.mutualdensity; set geog.mutualnum;

mutual_density = log (1+mutualnum);

run;

Thank you.

Mspak

7 REPLIES 7
art297
Opal | Level 21

Your code shouldn't have worked as posted  The first proc sql had a semicolon after the group by clause.  Plus, why create a big file only to eliminate the unnecessary records?

Doe the following accomplish the initial step you wanted?:

proc sql;

  create table geog.fundcom as

    select a.fyear,

           tic,

           crsp_fundn,

           zipcitydistance(zipcode, zip)as mutualdist,

           count(*) as mutualnum

      from geog.comp a , geog.Mutualcom b

        where a.fyear=b.year

          group by a.fyear,tic

            having calculated mutualdist between 0 and 60

  ;

quit;

mspak
Quartz | Level 8

Thanks art297 for your kind helps especially to me who is a new SAS user.

I ran the program and it proves successful. There are many obervations with different pairs of possible origin and destination. In fact I wish to have mutualnum for each tic-fyear. Therefore, the crsp_fundn is not important to be included. I have modified it to generate my desired output format.

I came across a data step approach which can significantly reduce the processing time, namely DATA Step Hash Object. I am not sure whether this method can be used for this example and can this method further reduces the data storage and time?

Have a nice day Smiley Happy

Regards,

mspak

Haikuo
Onyx | Level 15

Hi Mspak,

My solution to your another question could be applied to this senario without many changes:

data want (drop=_:);

  if _n_=1 then do;

     set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);

h.definekey('fyear');

h.definedata('_zip');

h.definedone();

end;

set geog.comp;

   _ct=0;

   _rc=find();

   do while (_rc=0);

      if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

_rc=find_next();

end;

    mutual_density=log(1+_ct);

run;

of couse not tested, so my code is subjected to typo and other errors.

Regards,

Haikuo

Edit: Before running the code, I would clean up the tables to make sure all of the zip codes are legit per zipcitydistance().

mspak
Quartz | Level 8

Thanks HaiKuo for helps,

I will test run the program too. Learning SAS Programming with helps from all the experts here is fun Smiley Happy. Hope that I can master this skill one day!!!

Goodnite (12 Marc 2012, 11.57pm)

Regards,

mspak

mspak
Quartz | Level 8

Hi art297 and Hai.kuo,

I tested program suggested by Hai.kuo, some error messages on log:

16   data geog.want (drop=_:);

17

18     if _n_=1 then do;

19

20        set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

21

22   dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes',

22 ! hashexp:16);

23

24   h.definekey('fyear');

25

26   h.definedata('_zip');

27

28   h.definedone();

29

30   end;

31

32

33

34   set geog.comp;

35

36      _ct=0;

37

38      _rc=find();

            ----

            71

ERROR 71-185: The FIND function call does not have enough arguments.

39

40      do while (_rc=0);

41

42         if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

43

44   _rc=find_next();

         ---------

         68

ERROR 68-185: The function FIND_NEXT is unknown, or cannot be accessed.

45

46   end;

47

48       mutual_density=log(1+_ct);

49

50   run;

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set GEOG.WANT may be incomplete.  When this step was stopped there were 0

         observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           1.06 seconds

      cpu time            0.03 seconds

Thanks for help. I carefully study the data generated by submitting the program suggested by art297, the output is strange as firms in the same year have the same nutualnum and mutual_density. I came across the same problem too. The count function for firms within 60-mile radius is tricky, I re-calculated mutualnum using sum function instead as per follows:

proc sql;

create table mutualnum as

select fyear, tic, sum(0<=mutualdist<=60) as mutualnum1

from geog.fundcom

group by tic, fyear;

quit; run;

proc sql;

create table geog.mutualden as

select unique fyear, tic, mutualnum1, log(1+ mutualnum1) as mutual_density

from mutualnum

having not missing (calculated mutual_density);

quit;

quit; run;

The correct result can be achieved by using sum function.

I think I must carefully study why and in what circumstances that count function can be tricky?

Thank you.

mspak

Haikuo
Onyx | Level 15

oops. There were some obvious errors, I did not reference the hash name. Try this:

data want (drop=_:);

  if _n_=1 then do;

     set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);

h.definekey('fyear');

h.definedata('_zip');

h.definedone();

end;

set geog.comp;

   _ct=0;

   _rc=h.find();

   do while (_rc=0);

      if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

_rc=h.find_next();

end;

    mutual_density=log(1+_ct);

run;

mspak
Quartz | Level 8

Hi Hai.Kuo,

Thanks and I will try again. I just download a few articles relating to thsi topics so that I can make use of this efficient function in future.

Have a nice day!!!

Time now at my place: 7.19pm

Regards,

MEI SEN

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1171 views
  • 6 likes
  • 3 in conversation