BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Theo_Gh
Obsidian | Level 7

Hello everyone,

 

I'm trying to merge compustat global annual data with compustat global security data by gvkey. The codes I have used so far don't work. 

Your assistance would be very much appreciated.

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Your "want" dataset has 7.8M records, but  the security dataset had 8.6M.  That means there are .8M security records for which there were no matching annual records.  I presume that is unsurprising, right?

 

What does your other annual data look like? Does it have gvkey? You mention fyear (fiscal year).  Are you saying that your other annual data does not have datadate?

 

Why do you think that sorting by fyear is neccessary? If your current data is sorted by gvkey / datadate, is it not likely to also already be sorted by gvkey/fiscal year?  You can test this by this simple step:

 

  data _null_;

    set want ;

    by gvkey fyear;

  run;

 

If SAS produces no "out of order" message on the log for this program, you have demonstrated that the data is already sorted.

 

Also maybe you should have merged all the annual records first (smaller N), then merge with the daily data.

 

PS: For latecomers to this topic, here's the link to the code that answered the topic question:

https://communities.sas.com/t5/General-SAS-Programming/merging-compustat-global-annual-data-and-comp...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

24 REPLIES 24
Theo_Gh
Obsidian | Level 7

As I indicated, I used the wrong identifier so I did check my data and did the right thing thanks to you guys.

 

On this issue, I have actually succeeded in merging them but I'm not sure if it is correct. I used these codes:

 

proc sort data=data1;

    by gvkey;

run;

proc sort data=data2;

    by gvkey;

run;

data data3;

    merge data1(in=a) data2(in=b);

    by gvkey;

    if b;

run;

 

Thanks

Kurt_Bremser
Super User

First of all, take a look at the log. There should be no WARNINGs or ERRORs. The NOTEs should only be of the nature "xxx records read from ..." and "the dataset .... has xxxx records and yyy variables".

Crucial for a datastep merge is the relationship between the datasets. It should be 1:1 or 1:n, m:n are probably handled better with SQL.

So if you have any doubts, post the log.

Shmuel
Garnet | Level 18

If both data1 and data2 have same gvkey more than once ( relation n:m where n>1 and m>1)

you may have issues with the merge.

 

If in any of the two datas each gvkey occures once only it will work perfectly.

 

In Case of N:M relation better do:

proc sql;
     create table want as select *
     from data1 as a
     join data2 as b
     on a.gvkey = b.gvkey;
quit;

In case not all keys in data1 exsist in data2 or vice-versa then you may need

use inner join or full join.

See next PDF documentation:

https://www.google.co.il/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwiMkqTum4rSAhWIWxQKHWsdB...

Theo_Gh
Obsidian | Level 7

Thank you very much.

 

I'm posting the log here; i got no error per se. Please ignore the data set definitions. 

 

My next challenge is to calculate some variables at the end of the year fiscal year. My question is how to set the date right.

Thank you.

 

 

13   proc sort data=theo.to_be_merged;

14       by gvkey;

15   run;

 

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

 

 

16   proc sort data=theo.security_daily;

17       by gvkey;

18   run;

 

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.00 seconds

      cpu time            0.01 seconds

 

 

19   data data3;

20       merge theo.to_be_merged(in=a) theo.security_daily(in=b);

21       by gvkey;

22       if b;

23   run;

 

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 2357018 observations read from the data set THEO.TO_BE_MERGED.

NOTE: There were 86856448 observations read from the data set THEO.SECURITY_DAILY.

NOTE: The data set WORK.DATA3 has 86857588 observations and 38 variables.

NOTE: DATA statement used (Total process time):

      real time           7:48.17

      cpu time            53.68 seconds

mkeintz
PROC Star

You are merging annual with daily data. Given that it's Compustat, both dataset probably have identifiers gvkey and datadate, where datadate is the last date of the fiscal year for annual data,, and day of trading for the security data.

 

Question:  do you want to merge the security data with the immediately PRECEDING annual data, or the FOLLOWING annual data?

 

And also, are you going to allow for the fact that accounting data is not usually reported until about 45 days after the fiscal year?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

I don't know if it's correct but I would hesitate to say it was.  

 

Look at your last step and notice how your final results have more than either of your original tables. You need to figure out why this occurs, where and if it makes sense given your data. 

mkeintz
PROC Star

Why is there only about 43 daily records per annual record?  There are typiically 200 trading days per year.  Do you have instance of annual accounting data for GVKEY's that do no appear in the daily security data file? If so, what do you want to do about it?

 

Your program,merging by gvkey, will not produce what you want, unless (for those gvkey's in both datasets) you have precisely the same series of datadates in both datasets.

 

Here is a program that outputs every daily record (from dataset SEC) merged with the closest following annual record (dataset ANN).  Both files are sorted by gvkey datadate.

 

data want;
  do until (ina1 or last.gvkey);
    set sec ann (in=ina1);
    by gvkey datadate;
  end;

  do until (ina2 or last.gvkey);
    set sec (in=ins) ann (in=ina2 keep=gvkey datadate);
    by gvkey datadate;
    if ins=1 and ina1=1 then output;
  end;
run;

Notes:

 

  1. The first "do until" reads all the records until an annual record.This gets the annual record variables into the program data vector(pdv).
  2. The second "do until" rereads all the same records.
    1. Note there is a "keep=gvkey datadate" for the ANN dataset.  Otherwise all the other annual vars read in by the first do group would be reset to missing with every incoming record, whether from SEC or ANN.  And since SEC doesn't have those variables they would stay missing. The "keep=" parameter tells SAS not to revise ANN vars not in the keep list.
    2. Use a new dummy var INA2, because dummy INA1 is still being used in this do group.
  3. Only incoming records coming from the SEC dataset (INS=1), and for which there was a qualifying ANN record found in the first "do until" (INA1=1) are written out.  This eliminates all cases of gvkey in SEC but not not having corresponding ANN records.
  4. If you have, say, two years of SEC records but only one ANN record at the end of the two years, and if you want only the 2nd year of SEC data, then  this modification is needed, that uses a date_cutoff of 1 year prior to the annual datadate:

 

 

data want;
  do until (ina1 or last.gvkey);
    set sec ann (in=ina1);
    by gvkey datadate;
  end;

  if ina1 then date_cutoff=intnx('year',datadate,-1,'same')+1;

  do until (ina2 or last.gvkey);
    set sec (in=ins) ann (in=ina2 keep=gvkey datadate);
    by gvkey datadate;
    if ins=1 and ina1=1 and datadate>=date_cutoff then output;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Theo_Gh
Obsidian | Level 7

Thank you very much.I did run your codes and got this:
NOTE: There were 86856448 observations read from the data set THEO.COPY_SECURITY.
NOTE: There were 373633 observations read from the data set THEO.COPY_COMPUSTAT.
NOTE: There were 86856448 observations read from the data set THEO.COPY_SECURITY.
NOTE: There were 373633 observations read from the data set THEO.COPY_COMPUSTAT.
NOTE: The data set WORK.WANT has 78137460 observations and 45 variables.

 

I noticed that the annual data has been matched against the daily data so that's good. But I have to merge the 'want' data set with other annual data sets. Will a simple proc sort by fyear help?

 

Thank you once again for your time and help !

 

mkeintz
PROC Star

Your "want" dataset has 7.8M records, but  the security dataset had 8.6M.  That means there are .8M security records for which there were no matching annual records.  I presume that is unsurprising, right?

 

What does your other annual data look like? Does it have gvkey? You mention fyear (fiscal year).  Are you saying that your other annual data does not have datadate?

 

Why do you think that sorting by fyear is neccessary? If your current data is sorted by gvkey / datadate, is it not likely to also already be sorted by gvkey/fiscal year?  You can test this by this simple step:

 

  data _null_;

    set want ;

    by gvkey fyear;

  run;

 

If SAS produces no "out of order" message on the log for this program, you have demonstrated that the data is already sorted.

 

Also maybe you should have merged all the annual records first (smaller N), then merge with the daily data.

 

PS: For latecomers to this topic, here's the link to the code that answered the topic question:

https://communities.sas.com/t5/General-SAS-Programming/merging-compustat-global-annual-data-and-comp...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Theo_Gh
Obsidian | Level 7

Thank you very much for your time and patience. I don't know what I would done without this platform.

 

No, it's not surprising that the daily data is more than the annual data. From the information I have not all the firms are actively trading on the stock exchange.

Yes, I can merge the annual data before merging with the daily data. My question still is: take for instance that I want the market value as of the end of a particular year. If I still simply multiply share price by number of shares outstanding, I'm going to have different values since the daily stock prices have been matched against the annual shares outstanding. In this particular eg i want the share price at the end of the fiscal year times the closest annual number of shares outstanding (which should give me one value). I guess my question is : How do I need to do so that after matching annual and daily data sets I would get one value for an answer ( which should be the end of fiscal year value) and not multiple values, which should be the same nonetheless.

 

I count on your usual corporation

mkeintz
PROC Star

@Theo_Gh wrote:

Thank you very much for your time and patience. I don't know what I would done without this platform.

 

No, it's not surprising that the daily data is more than the annual data. From the information I have not all the firms are actively trading on the stock exchange.

 

I was asking about the LOSS of .8M SECURITY records, which (I presume) represents daily prices on public stock exchanges. They are ignored by the program because you have no annual data to match them. That's the opposite of your explanation (inactive stock trading). Since the annual data presumably comes from SEC filings extracted by Compustat, either (1) SEC  does not have that company's filings (which I believe is not possible for publicly traded firms), or (2) Compustat did not retrieve those filings, or (3) your lost SEC records are recent, and the corresponding SEC filing has yet to be filed or processed by Compustat.  My question was whether that indicated a possible research problem in the data your are using.

 

The annual Compustat data has the number of outstanding shares at the end of the fiscal year (CSHO - i.e. on the DATADATE of the annual record.  Of course the problem is that there is no guarantee that shares outstanding were constant throughout the fiscal year.

 

However, the standard Compustat securities daily file has not only closing price variable (PRCCD - "Price Closing - Daily"), but also apparently daily common shares outstanding (CSHOC - "Common Shares Outstanding - Current").  Does your daily file have these variables? That will solve the problem of calculating marekt value.

 

BTW, please realize there can be more than one security stored under the same GVKEY (i.e. a company may have multiple classes of stocks on the markets).  These are distinguished by the variable IID (Issue ID).  You should make sure your securities file only has the IID of the standard common stock of the company.

 

Apologies to forum participants for getting into Compustat-specific issues so deeply, but the problems here have structural issues that can be generally informative.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Theo_Gh
Obsidian | Level 7
Thank you.
No, current file does not have them but I'm downloading them now (will take a while; internet is slow). I will do as you have advised and get back. Thank you.
Theo_Gh
Obsidian | Level 7

I have downloaded the daily security data with IID, PRCCD,CSHOC so I can calculate the daily market value of equity. However, the annual compustat data does not have iid. I'm interested in the market value of equity at the end of the fiscal year or the nearest. If I'm able to get these to match the annual compustat data, it should not present any serious research problem. No? 

 

Any way, I tried using the datadata and SEDOL( The Stock Exchange Daily Official List(SEDOL) Code identifies issues) and got similar results as with gvkey.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 7747 views
  • 2 likes
  • 5 in conversation