Good days to all,
I recently used SQL to generate possible pairs of geograhic zipcodes between one location to another. I have about 13,000 firms per year for 8 years. Therefore, number of firm-years is estimated to be 104,000. Then, I have to calculate the distance between every firm-year to all of the mutual fund companies (which is about 17,000 firms for each year). By the end, I have to calculate mutual fund density, which is defined as log (1+mutual fund companies with distance within 60 miles radius from the firm).
The submitted program was run for 7 hours but I failed to generate output due to insufficient spaces. My notebook has free space about 365 GB but yet insufficient for storing the output. I am looking forward any response which can solve the program.
My concerns:
1. How to submit the program in batches?
2. How can I estimate the processing time. I couldn't find any indicator on the estimated time and storage needed for running programs.
3. Any recommendations and suggestions...etc.
The following is my program (with libname "geog"):
proc sql;
create table geog.fundcom as
select a.fyear, tic, crsp_fundn, zipcitydistance(zipcode, zip)as mutualdist
from geog.comp a , geog.Mutualcom b
where a.fyear=b.year
group by a.fyear,tic;
having not missing(calculated mutualdist);
quit;
proc sql;
create table geog.mutualnum as
select fyear, tic, sum(0<=mutualdist<=60) as mutualnum
from geog.fundcom
group by tic, fyear;
quit; run;
data geog.mutualdensity; set geog.mutualnum;
mutual_density = log (1+mutualnum);
run;
Thank you.
Mspak
Your code shouldn't have worked as posted The first proc sql had a semicolon after the group by clause. Plus, why create a big file only to eliminate the unnecessary records?
Doe the following accomplish the initial step you wanted?:
proc sql;
create table geog.fundcom as
select a.fyear,
tic,
crsp_fundn,
zipcitydistance(zipcode, zip)as mutualdist,
count(*) as mutualnum
from geog.comp a , geog.Mutualcom b
where a.fyear=b.year
group by a.fyear,tic
having calculated mutualdist between 0 and 60
;
quit;
Thanks art297 for your kind helps especially to me who is a new SAS user.
I ran the program and it proves successful. There are many obervations with different pairs of possible origin and destination. In fact I wish to have mutualnum for each tic-fyear. Therefore, the crsp_fundn is not important to be included. I have modified it to generate my desired output format.
I came across a data step approach which can significantly reduce the processing time, namely DATA Step Hash Object. I am not sure whether this method can be used for this example and can this method further reduces the data storage and time?
Have a nice day
Regards,
mspak
Hi Mspak,
My solution to your another question could be applied to this senario without many changes:
data want (drop=_:);
if _n_=1 then do;
set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));
dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);
h.definekey('fyear');
h.definedata('_zip');
h.definedone();
end;
set geog.comp;
_ct=0;
_rc=find();
do while (_rc=0);
if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;
_rc=find_next();
end;
mutual_density=log(1+_ct);
run;
of couse not tested, so my code is subjected to typo and other errors.
Regards,
Haikuo
Edit: Before running the code, I would clean up the tables to make sure all of the zip codes are legit per zipcitydistance().
Thanks HaiKuo for helps,
I will test run the program too. Learning SAS Programming with helps from all the experts here is fun . Hope that I can master this skill one day!!!
Goodnite (12 Marc 2012, 11.57pm)
Regards,
mspak
Hi art297 and Hai.kuo,
I tested program suggested by Hai.kuo, some error messages on log:
16 data geog.want (drop=_:);
17
18 if _n_=1 then do;
19
20 set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));
21
22 dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes',
22 ! hashexp:16);
23
24 h.definekey('fyear');
25
26 h.definedata('_zip');
27
28 h.definedone();
29
30 end;
31
32
33
34 set geog.comp;
35
36 _ct=0;
37
38 _rc=find();
----
71
ERROR 71-185: The FIND function call does not have enough arguments.
39
40 do while (_rc=0);
41
42 if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;
43
44 _rc=find_next();
---------
68
ERROR 68-185: The function FIND_NEXT is unknown, or cannot be accessed.
45
46 end;
47
48 mutual_density=log(1+_ct);
49
50 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set GEOG.WANT may be incomplete. When this step was stopped there were 0
observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 1.06 seconds
cpu time 0.03 seconds
Thanks for help. I carefully study the data generated by submitting the program suggested by art297, the output is strange as firms in the same year have the same nutualnum and mutual_density. I came across the same problem too. The count function for firms within 60-mile radius is tricky, I re-calculated mutualnum using sum function instead as per follows:
proc sql;
create table mutualnum as
select fyear, tic, sum(0<=mutualdist<=60) as mutualnum1
from geog.fundcom
group by tic, fyear;
quit; run;
proc sql;
create table geog.mutualden as
select unique fyear, tic, mutualnum1, log(1+ mutualnum1) as mutual_density
from mutualnum
having not missing (calculated mutual_density);
quit;
quit; run;
The correct result can be achieved by using sum function.
I think I must carefully study why and in what circumstances that count function can be tricky?
Thank you.
mspak
oops. There were some obvious errors, I did not reference the hash name. Try this:
data want (drop=_:);
if _n_=1 then do;
set geog.mutualcom (obs=1 rename=(year=fyear zip=_zip));
dcl hash h(dataset:'geog.mutualcom(rename=(year=fyear zip=_zip))', multidata:'yes', hashexp:16);
h.definekey('fyear');
h.definedata('_zip');
h.definedone();
end;
set geog.comp;
_ct=0;
_rc=h.find();
do while (_rc=0);
if .< zipcitydistance(zipcode, _zip)<=60 then _ct+1;
_rc=h.find_next();
end;
mutual_density=log(1+_ct);
run;
Hi Hai.Kuo,
Thanks and I will try again. I just download a few articles relating to thsi topics so that I can make use of this efficient function in future.
Have a nice day!!!
Time now at my place: 7.19pm
Regards,
MEI SEN
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.