PROC SQL PROCESSING

mspak · Posted 03-11-2012 05:09 AM

Good days to all,

I recently used SQL to generate possible pairs of geograhic zipcodes between one location to another. I have about 13,000 firms per year for 8 years. Therefore, number of firm-years is estimated to be 104,000. Then, I have to calculate the distance between every firm-year to all of the mutual fund companies (which is about 17,000 firms for each year). By the end, I have to calculate mutual fund density, which is defined as log (1+mutual fund companies with distance within 60 miles radius from the firm).

The submitted program was run for 7 hours but I failed to generate output due to insufficient spaces. My notebook has free space about 365 GB but yet insufficient for storing the output. I am looking forward any response which can solve the program.

My concerns:

1. How to submit the program in batches?

2. How can I estimate the processing time. I couldn't find any indicator on the estimated time and storage needed for running programs.

3. Any recommendations and suggestions...etc.

The following is my program (with libname "geog"):

proc sql;

create table geog.fundcom as

select a.fyear, tic, crsp_fundn, zipcitydistance(zipcode, zip)as mutualdist

from geog.comp a , geog.Mutualcom b

where a.fyear=b.year

group by a.fyear,tic;

having not missing(calculated mutualdist);

quit;

proc sql;

create table geog.mutualnum as

select fyear, tic, sum(0<=mutualdist<=60) as mutualnum

from geog.fundcom

group by tic, fyear;

quit; run;

data geog.mutualdensity; set geog.mutualnum;

mutual_density = log (1+mutualnum);

run;

Thank you.

Mspak

art297 · Posted 03-11-2012 11:27 AM

Your code shouldn't have worked as posted The first proc sql had a semicolon after the group by clause. Plus, why create a big file only to eliminate the unnecessary records?

Doe the following accomplish the initial step you wanted?:

proc sql;

create table geog.fundcom as

select a.fyear,

tic,

crsp_fundn,

zipcitydistance(zipcode, zip)as mutualdist,

count(*) as mutualnum

from geog.comp a , geog.Mutualcom b

where a.fyear=b.year

group by a.fyear,tic

having calculated mutualdist between 0 and 60

;

quit;

mspak · Posted 03-12-2012 10:11 AM

Thanks art297 for your kind helps especially to me who is a new SAS user.

I ran the program and it proves successful. There are many obervations with different pairs of possible origin and destination. In fact I wish to have mutualnum for each tic-fyear. Therefore, the crsp_fundn is not important to be included. I have modified it to generate my desired output format.

I came across a data step approach which can significantly reduce the processing time, namely DATA Step Hash Object. I am not sure whether this method can be used for this example and can this method further reduces the data storage and time?

Have a nice day

Regards,

mspak

Haikuo · Posted 03-12-2012 10:41 AM

Hi Mspak,

My solution to your another question could be applied to this senario without many changes:

data want (drop=_:);

if _n_=1 then do;

set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);

h.definekey('fyear');

h.definedata('_zip');

h.definedone();

end;

set geog.comp;

_ct=0;

_rc=find();

do while (_rc=0);

if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

_rc=find_next();

end;

mutual_density=log(1+_ct);

run;

of couse not tested, so my code is subjected to typo and other errors.

Regards,

Haikuo

Edit: Before running the code, I would clean up the tables to make sure all of the zip codes are legit per zipcitydistance().

mspak · Posted 03-12-2012 11:56 AM

Thanks HaiKuo for helps,

I will test run the program too. Learning SAS Programming with helps from all the experts here is fun . Hope that I can master this skill one day!!!

Goodnite (12 Marc 2012, 11.57pm)

Regards,

mspak

mspak · Posted 03-12-2012 11:26 PM

Hi art297 and Hai.kuo,

I tested program suggested by Hai.kuo, some error messages on log:

16 data geog.want (drop=_:);

17

18 if _n_=1 then do;

19

20 set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

21

22 dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes',

22 ! hashexp:16);

23

24 h.definekey('fyear');

25

26 h.definedata('_zip');

27

28 h.definedone();

29

30 end;

31

32

33

34 set geog.comp;

35

36 _ct=0;

37

38 _rc=find();

----

71

ERROR 71-185: The FIND function call does not have enough arguments.

39

40 do while (_rc=0);

41

42 if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

43

44 _rc=find_next();

---------

68

ERROR 68-185: The function FIND_NEXT is unknown, or cannot be accessed.

45

46 end;

47

48 mutual_density=log(1+_ct);

49

50 run;

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set GEOG.WANT may be incomplete. When this step was stopped there were 0

observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 1.06 seconds

cpu time 0.03 seconds

Thanks for help. I carefully study the data generated by submitting the program suggested by art297, the output is strange as firms in the same year have the same nutualnum and mutual_density. I came across the same problem too. The count function for firms within 60-mile radius is tricky, I re-calculated mutualnum using sum function instead as per follows:

proc sql;

create table mutualnum as

select fyear, tic, sum(0<=mutualdist<=60) as mutualnum1

from geog.fundcom

group by tic, fyear;

quit; run;

proc sql;

create table geog.mutualden as

select unique fyear, tic, mutualnum1, log(1+ mutualnum1) as mutual_density

from mutualnum

having not missing (calculated mutual_density);

quit;

quit; run;

The correct result can be achieved by using sum function.

I think I must carefully study why and in what circumstances that count function can be tricky?

Thank you.

mspak

Haikuo · Posted 03-13-2012 07:06 AM

oops. There were some obvious errors, I did not reference the hash name. Try this:

data want (drop=_:);

if _n_=1 then do;

set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));

dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);

h.definekey('fyear');

h.definedata('_zip');

h.definedone();

end;

set geog.comp;

_ct=0;

_rc=h.find();

do while (_rc=0);

if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;

_rc=h.find_next();

end;

mutual_density=log(1+_ct);

run;

mspak · Posted 03-13-2012 07:19 AM

Hi Hai.Kuo,

Thanks and I will try again. I just download a few articles relating to thsi topics so that I can make use of this efficient function in future.

Have a nice day!!!

Time now at my place: 7.19pm

Regards,

MEI SEN