BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LanMin
Fluorite | Level 6

Good morning,

******************************

Below is a display of my data (just showing a few observations for a large data set).

gvkey  year acode

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

Gvkey is firm identifier, acode is the analyst identifier, for firm 1000, it is followed by 2 financial analysts each is identified by its unique acode.

I want to compute how many firms each analyst cover in each year, expected outcome : for year 1990, acode=10, coverage=2 (it covers 1000, 1001 firms), for year 1990, acode=20, coverage=1. Etc.

I tried

proc freq data=a;

tables acode*year /out=counts;

run;

this will not work given that my data has 4 million records, computer froze while running.

can you suggest another way?

thanks,

Lan

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.


proc summary data=have nway ;

  by year acode ;

  output out=want (rename=(_freq_=count)) ;

run;

View solution in original post

12 REPLIES 12
naveen_srini
Quartz | Level 8

data have;

input gvkey  year acode;

datalines;

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

;

proc sort data=have out=h;

by year acode;

run;

data want;

set h;

by year acode;

if first.year then call missing(coverage);

if first.acode then coverage=0;

coverage+1;

if last.acode then output;

run;

HTH,

Naveen Srinivasan

L&T Infotech

Tom
Super User Tom
Super User

proc freq;

tables year*acode / output=want ;

run;

LanMin
Fluorite | Level 6

Hi Tom,

Your code is the same as my original post. Like I said when I ran it, with 4 million observations, computer froze, I wonder if some changes needed .

thanks,

Lan

naveen_srini
Quartz | Level 8

Well, Tom uses the year variable as row and acode as column which is the opposite of your code to make a nice cross tab , which I believe is what you are after. Make that change as per tom's and run again. 4 million obs aint that big at all.

Tom
Super User Tom
Super User

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.


proc summary data=have nway ;

  by year acode ;

  output out=want (rename=(_freq_=count)) ;

run;

PGStats
Opal | Level 21

Try

proc sql;

create table want as

select acode, year, count(gvkey) as coverage

from have

group by acode, year;

quit;

Any performance improvement will require indexes being defined on your data table.

PG

PG
art297
Opal | Level 21

I disagree! With 4 million records, the number of combinations could easily require proc freq to need more memory than you might have.  I'd use Naveen's suggested data step method. You should just change the sort and by variables to match the order you want in your resultant file. The following eliminates one unnecessary step in his suggested code. e.g.:

proc sort data=have out=h;

  by acode year;

run;

data want;

  set h;

  by acode year;

  if first.year then coverage=1;

  else coverage+1;

  if last.year then output;

run;

LanMin
Fluorite | Level 6

I want to thank naveen, Arthur, PGstats, and Tom. All of your suggestions work, in my mind, they are all correct.

When I use proc freq as Tom suggested (without using noprint option). the computer was overwhelmed.

The last post by Tom (using proc summary ) is the fastest way for my purpose. I compared the datasets using both Arthur's code and Tom's , they produce the same results.

now I have a data set with

cusip  year acode count

they are firm identifier, year, analyst identifier, and  the number of firms covered by each analyst.

I want to compute the median value of the number of firms covered by each analyst by each firm.

e.g.

cusipyearacodecount
100119901000012
100119901001013
100119901002014

then I expect the median =13,

is this correct way to do it ?

proc means data=have noprint;

   class cusip year;

   output out=stats;

   output out=median median=;

   run;

Thanks,

Lan

art297
Opal | Level 21

Lan: In the future you really should ask such questions by starting a new discussion.

You've changed the names of your variables, as well as the level of data you are trying to analyze.  According to your latest example, you are trying to analyze COUNTS (are they what were previously labeled COVERAGE?), for each CUSIP (are they what were previously called GVKEY?), for each year and ACODE.

That isn't what your first question produced.

Proc means and proc summary use the exact same algorithm but, since proc summary doesn't produce any listing, NOPRINT wouldn't be needed with it. Two critical things you are missing are the NWAY option and the identification of an analysis variable.

If your data (I'll refer to the file as WANT as that is what the first step created) are already sorted, you would be better off using a BY statement rather than a CLASS statement. Both will accomplish the same thing but, if your data are already in the right order, the BY statement will cause the procedure to run faster.

proc summary data=want nway;

  by cusip year;

  var count;

  output out=median (drop=_:) median=;

run;

LanMin
Fluorite | Level 6

Thank you very much, Arthur.

I apologize for not practicing the best policy in terms of posting. In the future, I will separate my questions into new posts. I will also be consistent with variable naming.

Your code works great.

Sincerely,

Lan

naveen_srini
Quartz | Level 8

Hi Lan, It would make it convenient for community members to search through the forum if you break your questions to new discussions that warrants a new subject or a new solution.

Anyway, no worries, Continuing on  Art's code from the above, do another sort and typical transpose :

proc sort data=have out=h;

  by acode year;

run;

data want;

  set h;

  by acode year;

  if first.year then coverage=1;

  else coverage+1;

  if last.year then output;

run;

proc sort data=want out=w;

by year;

run;

data w;

array a(25) _temporary_ ;

call missing(of a{*});

do _n_= 1 by 1 until(last.year);

set w;

by year;

a(_n_)=coverage;

med_value=median(of a{*});

end;

run;

Sorry, i am too lazy to look through your "renamed additional requirement", so just took the names as it was in your original question.

LanMin
Fluorite | Level 6

Thanks so much to you Naveen as well !!!   I am not familiar with array, will try to learn more.

Sincerely,

Lan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 27606 views
  • 7 likes
  • 5 in conversation