Re: Extreme data and transform from daily to yearly data

Phil_NZ · Posted 07-12-2020 08:10 PM

Hi SAS community,
Have an awesome week!

I face an issue when dealing with extreme data as described below:

I have the data as below:

Type date p_us up_us nosh trading_vol amihud
134495	01JAN1987	.	.	.	0	.
134495	02JAN1987	.	.	.	.	.
134495	05JAN1987	.	.	.	.	.
.
.
134495	27DEC2019	1.02	1.02	14632	.	.
134495	30DEC2019	0.99	0.99	14632	.	.
134495	31DEC2019	0.99	0.99	14632	.	.
.
.
992438	01JAN1987	.	.	.	0	.
992438	02JAN1987	.	.	.	.	.
992438	05JAN1987	.	.	.	.	.
.
.
992438	27DEC2019	0.01	0.01	9000	.	.
992438	30DEC2019	0.01	0.01	9000	.	.
992438	31DEC2019	0.01	0.01	9000	.	.

While Type is the name of stocks (character variable- around 8,000 firms), date is ddmmyyyy9. (daily data from 1/1/1987 to 31/12/2019), and others are numeric variables.

It is my code when I calculate the average of "amihud" variable based on firm-year level:

*I got year by using the code year= year(date);
 proc sort data=amihud_;
by Type year;
run;

proc means data=amihud_ mean noprint;
var amihud;
by Type year;
output out=amihud_f n=obs mean=amihud;
run;

And I have results as below:

Type year _TYPE_ _FREQ_ obs amihud
134495	1987	0	261	0	.
134495	1988	0	261	0	.
134495	1989	0	260	0	.
134495	1990	0	261	0	.
.
.
134495	2016	0	261	15	0.0002618879
134495	2017	0	260	20	0.0002209701
134495	2018	0	261	4	0.0002583137
134495	2019	0	261	0	.
.
.
992438	1987	0	261	0	.
992438	1988	0	261	0	.
992438	1989	0	260	0	.
992438	1990	0	261	0	.
992438	1991	0	261	0	.
992438	1992	0	262	0	.
992438	1993	0	261	72	0.0000721906
.
.
992438	2017	0	260	0	.
992438	2018	0	261	0	.
992438	2019	0	261	0	.

But now, I need to add more conditions to my calculation including: (1) removing stock-year observations if this stock has p_us at the end of the
previous year in the top or bottom 1% of the cross section; (2) drop stock-year observations if this stock does not have data on variable "nosh" at the end of previous year; and (3) excluding any year with fewer than 10 stocks (column Type).

Many thanks and best regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Reeza · Posted 07-12-2020 08:47 PM

So if that's your intput, what would you want as the output? Does your shown data have all the use cases you've described below?

1) removing stock-year observations if this stock has p_us at the end of the
previous year in the top or bottom 1% of the cross section; (what defines a 'cross section')?
(2) drop stock-year observations if this stock does not have data on variable "nosh" at the end of previous year; and (3) excluding any year with fewer than 10 stocks (column Type).

Phil_NZ · Posted 07-12-2020 10:48 PM

Hi @Reeza ,

Thank you for your comment, I try to make it more clear as below:

1. "Cross section" means that " we examine the values of different subjects (distinct Types in my case) at the same time (year in my required output)". For example, in my case, in a specific year, for example, regarding year 1988, I will sort the p_us on 31 Dec of the year 1987 of all stocks. And then, in the year 1988, I will exclude the stocks that have the p_us in year 1987 in top or bottom 1% of p_us values on 31dec of year 1987 of the whole sample. And doing similar progress to all other years (1987-2019).

And the data above does not contain the use cases I described above, I just simply describe my dataset structure.

I hope it is a little bit clearer now.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

mkeintz · Posted 07-12-2020 08:59 PM

@Phil_NZ wrote:

But now, I need to add more conditions to my calculation including: (1) removing stock-year observations if this stock has p_us at the end of the
previous year in the top or bottom 1% of the cross section; (2) drop stock-year observations if this stock does not have data on variable "nosh" at the end of previous year; and (3) excluding any year with fewer than 10 stocks (column Type).

So you want (is this right?):

Exclude in year t all stocks that had P_US on 31dec of year t-1 in top percentile of P_US values on 31dec of year t-1
Exclude in year t all stocks that have missing NOSH on 31dec of year t-1

Notice I am interpreting your phrase "end of previous year" as 31dec of year t-1, even if there is no stock trading on 31dec.

And by "excluding any year with fewer than 10 stocks", do you mean

Any year with fewer the 10 distinct values of TYPE, use no data from any stock for that year (and how does such a thing happen if you are tracking 8,000 stocks?).
or
exclude any stock from year t if that particular stock has fewer than 10 dates observed in year t
or
Something else

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Phil_NZ · Posted 07-12-2020 09:36 PM

Hi @mkeintz and @Reeza .
Sorry for getting your wires crossed.

So, my aim here is to have the results of the average of "amihud" as calculated by the code above from the given dataset, but with considering these three more conditions apart from my average calculation above. I mean, I want to have the results similar to the results above, but taking considering to the three given further conditions.

Regarding condition (1) and (2), it is similar to what @mkeintz mentioned, but I need to adjust a little bit about the order of the variables:

1. Exclude the stock-year observation if, in this year, the stock had P_US on 31dec of year t-1 in the top or bottom 1% of P_US values on 31dec of year t-1 of all stocks, or "amihud" of this stock in this year is in top 1% of "amihud" of all stock in this year.
2. Exclude the stock-year observation if, in this year, the stocks have missing NOSH on 31dec of year t-1
3. And regarding the condition (3): exclude any year if, in this year, there are fewer than 10 stocks (10 distinct values of TYPE, as @mkeintz mentioned, thank you for your clarification again) satisfying these two above conditions.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Phil_NZ · Posted 07-13-2020 05:40 AM

I also intend to import my whole data co0ntain all the cases as @Reeza mentioned but our forum just allows a maximum of 5 MB of attachments, so I am not able to do so. But the structure of the data is quite similar like I describe above. I am wondering if we can post a google drive link or the like to our forum.

Many thanks and best regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

mkeintz · Posted 07-13-2020 02:56 PM

@Phil_NZ

We don't need (and wouldn't want) the complete data. You can easily satisfy @Reeza 's request by providing a subset of, say 30 TYPES for 2 or 3 years. It's certainly a lot easier to help someone with coding when there is data to use for testing our responses.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Phil_NZ · Posted 07-13-2020 04:48 PM

Thank you very much @mkeintz and @Reeza for your suggestions and explanation,

After receiving your explanation, I included a sample of 80 distinct firms in my dataset with the daily data from 1/1/2016 to 31/12/2019 (the reason to choose such period but not the earlier period is because there are more observations during the later period of time that we can have a more visible look on the dataset).

I attached as the datatype sas7bdat, I am not sure whether it suits you, please let me know if any adjustments required.

Hi @Reeza , thank you for your concern, and I can say this dataset is not from WDRS, and it is okay to be shared. By the way, it is just a very small part of my dataset.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Reeza · Posted 07-13-2020 03:02 PM

If this is WRDS data you're also not allowed to share it like that...

Phil_NZ · Posted 07-13-2020 06:54 PM

Offtrack a little bit, based on the data given above, do you know how can I adjust the code to achieve three other columns in the output called p_us,up_us, and nosh where p_us, up_us, and nosh are the value of p_us, up_us, and nosh of the last day of the year associated with amihud.

To be more clear, from my data input (file data_almihud_) attached below, when running the code as below:

*I got year by using the code year= year(date);
 proc sort data=amihud_;
by Type year;
run;

proc means data=amihud_ mean noprint;
var amihud;
by Type year;
output out=amihud_f n=obs mean=amihud;
run;

I have the result as below:

Type year _TYPE_ _FREQ_ obs amihud
134495	2016	0	261	15	0.0002618879
134495	2017	0	260	20	0.0002209701
134495	2018	0	261	4	0.0002583137
134495	2019	0	261	0	.
.
.
992438	2017	0	260	0	.
992438	2018	0	261	0	.
992438	2019	0	261	0	.

I am wondering what should I add into this code to have three other columns in the output called p_us,up_us, and nosh where p_us, up_us, and nosh are the values of p_us, up_us, and nosh of the last day of every year from the input data(associated with variable amihud)

I mean, I want the output as below

Type    year   _TYPE_ _FREQ_  obs       amihud         p_us   up_us   nosh
134495	2016	0	261	      15	   0.0002618879
134495	2017	0	260	      20	   0.0002209701
134495	2018	0	261	      4	       0.0002583137
134495	2019	0	261	      0	       .
. 
.
992438	2017	0	260	      0	      .
992438	2018	0	261	      0	      .
992438	2019	0	261	      0	      .

I suggest this way of doing because I think that after we have such results containing variables "amihud", "p_us", "up_us", and "nosh", it is easier and clearer to generate the code satisfying the three conditions I mentioned above.

Many thanks and crossing my fingers.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

mkeintz · Posted 07-13-2020 09:26 PM

There is too much mission creep here.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Phil_NZ · Posted 07-13-2020 09:30 PM

Hi everyone, under your encouragement and else, I found out the way out to make my dataset more clear and obtaining "clear" data as below.

So, my dataset "have" is as attached below (containing 2794 distinct companies (column TYPE)).

A quick description is as below:

Type year obs amihud up_us nosh vo p_us 
13016K 1987 0 . . . . .
13016K 1988 0 . . . . .
13016K 1989 0 . . . . .
13016K 1990 0 . . . . .
13016K 1991 0 . . . . .
13016K 1992 0 . . . . .
13016K 1993 0 . . . . .
13016K 1994 0 . . . . .
13016K 1995 0 . . . . .
13016K 1996 0 . . . . .
13016K 1997 0 . . . . .
13016K 1998 0 . . . . .
13016K 1999 0 . . . . .
13016K 2000 0 . 0.2945 40120 0.3 42.9638
13016K 2001 0 . 0.1945 40120 . 28.3783
13016K 2002 0 . 0.1802 45558 0.7 26.2883
13016K 2003 0 . 0.2712 45908 . 39.5715
13016K 2004 0 . 0.1568 45908 . 22.8729
13016K 2005 0 . 0.0337 45908 . 4.9228
13016K 2006 0 . 0.0394 45908 1.4 5.7496
13016K 2007 0 . 0.0483 97085 . 7.4443
13016K 2008 0 . 0.0349 100233 . 5.3736
13016K 2009 0 . 0.0351 120065 . 5.5743
13016K 2010 0 . 0.0205 152711 . 3.2582
13016K 2011 0 . 0.0195 308196 0.5 3.8958
13016K 2012 0 . 0.0093 308196 . 1.8687
13016K 2013 0 . 0.0018 308196 . 0.3579
13016K 2014 0 . 0.0368 315410 8.9 0.3683
13016K 2015 0 . 0.0255 1077004 . 0.2546
13016K 2016 4 0.0005137639 0.2498 159721 20.3 0.2498
13016K 2017 9 0.0000546129 0.1799 161321 198.4 0.1799
13016K 2018 2 0.0002143573 0.1619 161321 50 0.1619
13016K 2019 2 0 0.1687 177464 . 0.1687
130347 1987 0 . . . . .
130347 1988 0 . . . . .
130347 1989 0 . . . . .
130347 1990 0 . . . . .
130347 1991 0 . . . . .
130347 1992 0 . . . . .
130347 1993 0 . . . . .
130347 1994 47 0.0000738433 0.7 43242 22.5 0.64
130347 1995 35 0.0000662663 0.65 43242 . 0.59
130347 1996 25 0.0000750143 1.08 43242 566.1 0.99
130347 1997 44 0.0001311861 1.11 66105 . 1.11

So, now, the questions (under the bits of help in clarification of @mkeintz and @Reeza ) become:

From the dataset "have" added, we:

Exclude in year t the stocks (column TYPE) that had P_US of year t-1 in the top 1% or bottom 1% of P_US values of all stocks in year t-1.
Exclude in year t the stocks that had missing NOSH in year t-1
Exclude any year if, in this year, there are fewer than 10 stocks (10 distinct values of TYPE) satisfying these two above conditions

Thanks in advance and best regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away