Creating dummy variables

Accepted Solution Solved
Reply
Regular Contributor
Posts: 183
Accepted Solution

Creating dummy variables

Hi,

I need to create dummy variables for each of the query from the following dataset.

1. If an insiderID trades in the same months for three consecutive years (tradedate), then that insiderID will be regular trader and dummy will be 1 for that insiderID, but if the insiderID trades both in the same months for three consecutive years and in other months in other years as well or if that insiderID does not have a certain pattern of trade (i.e. not same months every year) then that insiderID will be irregular trader and hence dummy will be 0. It means, for regular traders dummy will be 1, and for irregular traders dummy will be 0.

2. If an insiderID trades in the same month in three consecutive years, all trades that he makes in the same month are regular trades and dummy will be 1. But trades made in different months will be irregular trades and dummy will be 0. It means for regular trades dummy will be 1, and for irregular trades dummy will be 0. This second query says that a regular trader (insiderID) may have both regular and irregular trades, and an irregular trader (insiderID) may have both regular and irregular trades.

So, there will be two dummy variables for these two queries: one dummy variable will be for insiderID (i.e. regular and irregular traders) and another dummy variable will be for tradedates (i.e. regular and irregular trades) of each insiderID. How can I write the codes for this task? Your helps are always appreciated.

insiderID year CompanyISIN    tradedate

1016442004US775043102207/12/2004
1016442004US775043102208/12/2004
1016442005US775043102229/11/2005
1016442006US775043102207/12/2006
1016442008US775043102218/11/2008
1016442010US775043102202/12/2010
1016442011US775043102202/12/2011
1016442012US775043102228/02/2012
1016442012US775043102229/02/2012
1016442012US775043102206/12/2012
1018902004US580135101727/07/2004
1018902005US580135101702/05/2005
1018902005US580135101712/09/2005
1018902006US580135101709/02/2006
1018902006US580135101717/03/2006
1018902006US580135101713/09/2006
1018902006US580135101712/12/2006
1018902007US580135101723/10/2007
1018902007US580135101730/11/2007
1018902008US580135101723/10/2008
1018902009US580135101723/03/2009
1018902009US580135101712/05/2009
1018902010US580135101712/02/2010
1018902010US580135101722/10/2010
1018902011US580135101711/02/2011
1018902011US580135101725/04/2011
1018902012US580135101710/02/2012
1023462004US580135101722/10/2004
1028412006US45170X106331/10/2006
1028522004US039483102013/09/2004
1028522004US039483102014/09/2004
1028522004US039483102018/11/2004
1028522005US039483102030/12/2004
1028522005US039483102008/08/2005
1028522005US039483102019/08/2005
1028522006US039483102022/02/2006
1028522006US039483102010/05/2006
1028522006US039483102013/10/2006
1028522007US039483102023/03/2007
1028522007US039483102017/08/2007
1028522008US039483102008/08/2008
1028522009US039483102010/02/2009
1028522009US039483102010/08/2009
1028522010US039483102026/04/2010
1028522010US039483102003/08/2010
1028522011US039483102024/02/2011
1028522011US039483102025/02/2011
1028522011US039483102008/08/2011
1028522011US039483102001/11/2011
1425722008CH002575132928/04/2008
1425722011US052769106923/05/2011
1425722011US052769106907/07/2011
1425722011US052769106920/12/2011
1425722012US052769106906/03/2012
1425722012US052769106927/03/2012
1425722012US052769106904/04/2012
1425722012US052769106919/06/2012
1450352010CH002575132929/06/2010
1450352011CH002575132926/05/2011
1450352011CH002575132927/05/2011
1450352011CH002575132929/06/2011
1481752004US167250109501/07/2003
1481752004US167250109512/02/2004
1481752004US167250109520/04/2004
1481752004US167250109501/07/2004
1481752004US167250109520/08/2004
1481752005US167250109509/03/2005
1481752005US167250109502/06/2005
1481752005US167250109501/07/2005
1481752006US167250109523/06/2006
1481752007US167250109521/02/2007
1481752007US167250109528/02/2007
1481752007US167250109529/11/2007
1481752008US167250109521/02/2008
1481752008US167250109522/02/2008
1481752008US167250109526/02/2008
1481752009US167250109520/02/2009
1481752010US167250109514/01/2010
1481752010US167250109519/02/2010
1481752010US167250109522/02/2010
1481752010US167250109516/03/2010
1481752010US167250109517/03/2010
1481752010US167250109518/03/2010
1481752010US167250109521/04/2010
1481752011US167250109517/02/2011
1481752011US167250109518/02/2011
1481752011US167250109522/02/2011
1481752011US167250109522/06/2011
1481752011US167250109501/09/2011
1481752011US167250109507/09/2011
1481752011US167250109501/12/2011
1481752012US167250109516/02/2012
1481752012US167250109517/02/2012
1481752012US167250109520/02/2012
1481752012US167250109522/02/2012
1481752012US167250109501/03/2012
1481752012US167250109501/06/2012
1484412008US167250109521/02/2008
1484412008US167250109522/02/2008
1484412008US167250109526/02/2008
1484412009US167250109520/02/2009
1484412010US167250109519/02/2010
1484412010US167250109522/02/2010
1484412010US167250109525/10/2010
1484412010US167250109505/11/2010
1484412010US167250109509/11/2010
1484412011US167250109517/02/2011
1484412011US167250109518/02/2011
1484412011US167250109522/02/2011
1484412011US167250109505/04/2011
1484412012US167250109516/02/2012
1484412012US167250109517/02/2012
1484412012US167250109520/02/2012
1484412012US167250109522/02/2012

Accepted Solutions
Solution
‎06-27-2014 10:04 AM
Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to RichardinOz

Here is the cleaned up code, commented and re-tested against all the example trades you provided.

Note that I have renamed a lot of the temporary variables for code clarity

I have also removed the Sort step as it seems duplicates are not a problem in the final step.

I will put it back in if you report the problem again.

Richard

View solution in original post

Attachment

All Replies
Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

Data insider_data ;

     infile datalines delimiter='09'x ;

     informat insiderID $6.

                 year 4.

                 CompanyISIN  $12.

                 tradedate ddmmyy10.

                 ;

     format tradedate ddmmyy10. ;

     input insiderID year companyISIN tradedate ;

datalines ;

101644 2004 US7750431022 07/12/2004

101644 2004 US7750431022 08/12/2004

101644 2005 US7750431022 29/11/2005

run ;

Proc SQL ;

  Create table trade_months as

       Select Distinct

                 insiderID

            ,    month(tradedate) as trademonth

            ,    year

        From insider_data

       Order by 1, 2, 3

       ;

Quit ;

Data check_months (keep = insiderID trademonth)

       check_traders (keep = insiderID)

       ;

  Set Trade_months ;

  By insiderID trademonth year ;

  prev_year = lag(year) ;

  retain testmonth testID ;

  if first.insiderID then testID = 0 ;

  If first.trademonth then

       do ;

       prev_year = 0 ;

       testmonth = 0 ;

       end ;

  else do ;

       if year = sum(prev_year, 1) then testmonth + 1 ;

       else testmonth = 0 ;

       end ;

  if testmonth = 2 then

       do ;

       testID = 1 ;

       output check_months ;

       end ;

  if last.insiderID and testID then output check_traders ;

run ;

Proc SQL ;

  Create table want as

       select   INS.*

            ,      missing(RID.insiderID) as irregularID

            ,      missing(MTH.trademonth) as irregularTrade

       From insider_data INS

            left join check_traders RID

                 on RID.insiderID = INS.insiderID

            left join check_months MTH

                 on MTH.trademonth = month(INS.tradedate)

                 and MTH.insiderID  = INS.insiderID

            Order by 1, 2, 4

            ;

Quit ;

[tested code]

Richard

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

Hi Richard,

Thanks a lot. I have the following query:

I have SAS dataset saved in my computer. So should I write these following codes?

Data insider_data ;

     infile datalines delimiter='09'x ;

     informat insiderID $6.

                 year 4.

                 CompanyISIN  $12.

                 tradedate ddmmyy10.

                 ;

     format tradedate ddmmyy10. ;

     input insiderID year companyISIN tradedate ;

datalines ;

101644 2004 US7750431022 07/12/2004

101644 2004 US7750431022 08/12/2004

101644 2005 US7750431022 29/11/2005

run ;

In the want table it shows more observations than the original dataset? Why?

Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

Abu

Of course if you already have the data loaded then there is no need for the first datastep.  Be aware that I have treated year as a numeric variable - if it is text in your data you might see some messages about automatic conversions. 

The error message on the month function is more of a puzzle.  I would only expect that message if some of your tradedates were missing, which is unlikely.  The other possibility is that tradedate is a text (character) variable in your data.  To fix that you would have to use

     month(input(tradedate, ddmmyy10.))

Richard

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

Hi Richard,

In my dataset, year variable is also numeric.

Now there is no error message. But I see more observations in the want table than in the original dataset.

Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

I can only suggest that with your very large dataset there are inadvertent duplicates in the intermediate datasets.  Try adding the following line

Data check_months (keep = insiderID trademonth)

       check_traders (keep = insiderID)

       ;

  Set Trade_months ;

  By insiderID trademonth year ;

  If first.year ;

  prev_year = lag(year) ;

...

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

I have added that line. Still it shows more observations in the "want" table. There are 960 more observations in this "want" table.

Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

I have rerun my code and for the 114 sample rows you provided there are no duplications, so I do not have much to go on.  Can you detect anything about the 960 extra rows.  Are they exact duplicates of other rows in the output (leading to a work around to do a

     proc sort nodup ;

     by insiderID year tradedate CompanyISIN ;

)?

Are they belonging to insiderID where there is more than one "regular" month?

Can't help much more without access to the whole dataset, not really a feasible option.

Maybe someone else can come up with a solution.

Richard

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

After submitting the following codes 960 duplicate rows are deleted.

proc sort data = want nodup;

by insiderID year tradedate CompanyISIN ;

run;


And it shows the following message:

NOTE: 960 duplicate observations were deleted.



But I think there is another problem. The following is the snapshot of the "Want" table. I just copied it for the first insiderID (101644).


insiderID       year      CompanyISIN            tradedate        irregular     irregular

                                                                                       ID             Trade

1016442004US775043102207/12/200400
1016442004US775043102208/12/200400
1016442005US775043102229/11/200501
1016442006US775043102207/12/200600
1016442008US775043102218/11/200801
1016442010US775043102202/12/201000
1016442011US775043102202/12/201100
1016442012US775043102228/02/201201
1016442012US775043102229/02/201201
1016442012US775043102206/12/201200

For "irregularTrade" variable, if the trades are irregular then dummy is 1 and if those are regular then dummy is 0, right? For these tradedates (02/12/2010, 02/12/2011 and 06/12/2012) the values of irregularTrade are 0 since those are in the same month (december) for three consecutive years (2010, 2011 and 2012). But this table also shows 0 for irregularTrade for 07/12/2004, 08/12/2004 and 07/12/2006. These should be irregular trades and value should be 1 since these trades are not for three consecutive years (only in 2004 and 2006) although those were in the same month (december).

Please correct me if I am wrong. If I am right then what should be done?

Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

Abu

You are right.  I misinterpreted your original request.  I'm not able to work on a revision right now.  A solution would require capturing the first and last years of a sequence as well as the month number.

Richard

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

Hi Richard, although it's urgent for me but I will be waiting for your revision.

Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

Here is the revised code.  Note that I have changed the 'order by' order in the first SQL, and restored the desired order in a subsequent sort to try and ensure all duplicated are eliminated. 

Proc SQL ;

  Create table trade_months as

  Select Distinct

  insiderID

  , year

  , month(tradedate) as trademonth

  From insider_data

  Order by 1, 2, 3

  ;

Quit ;

Proc Sort nodup

  data = trade_months ;

  By InsiderID

  trademonth

  year

  ;

Run ;

Data check_months (keep = insiderID trademonth teststart testend)

  check_traders (keep = insiderID)

  ;

  Set Trade_months ;

  By insiderID trademonth year ;

  prev_year = lag(year) ;

  retain testmonth testID teststart testend ;

  if first.insiderID then testID = 0 ;

  If first.trademonth then

  do ;

  prev_year = 0 ;

  testmonth = 0 ;

  teststart = prev_year ;

  testend = 0 ;

  end ;

  else do ;

  if testmonth = 0 then teststart = prev_year ;

  if year = sum(prev_year, 1) then testmonth + 1 ;

  else testmonth = 0 ;

  end ;

  if  last.trademonth then

  do ;

  if testmonth >= 2 then

  do ;

  testID = 1 ;

  testend = year ;

  output check_months ;

  end ;

  else teststart = 0 ;

  end ;

  if last.insiderID and testID then output check_traders ;

run ;

Proc SQL ;

  Create table want as

  select INS.*

  , missing(RID.insiderID) as irregularID

  , missing(MTH.trademonth) as irregularTrade

  From insider_data INS

  left join check_traders RID

  on RID.insiderID = INS.insiderID

  left join check_months MTH

  on MTH.trademonth = month(INS.tradedate)

  and MTH.insiderID  = INS.insiderID

  and year between MTH.teststart and MTH.testend

  Order by 1, 2, 4

  ;

Quit ;

Richard

Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

Hi, your revised code has solved the issue for trades such as first insiderID (101644), but it has created another problem. You have one value for each "teststart" and "testend" for a month, but there may be multiple values for each "teststart" and "testend" for that month. For example, look at the "want" table for this insiderID (396157) from your revised code.

This trader has trades on 08/07/2010,  09/07/2010,  12/07/2010,  08/07/2011,  11/07/2011,  09/07/2012,  10/07/2012,  11/07/2012. Since he has trades in the same month (july) for at least three consecutive years (2010, 2011 and 2012), irregularTrade variables are 0 for each of these dates. But if you look at the dates of  18/07/2005,  07/07/2006,  09/07/2007,  07/07/2008, these are also in the same month (july) for at least three consecutive years (2005, 2006, 2007 and 2008), so irregularTrade variables should be 0 for these dates as well. But it shows 1 (I have made those dates bold). It means for this month (july), there should be two values for each "teststart" and "testend":

insiderID     testmonth     teststart     testend

396157          7               2005          2008

396157          7               2010          2012

There was not this problem when I run your previous code. I include the results from your previous code further below.
It means your previous code shows wrong results in some cases (such as first insiderID: 101644). Your revised code solves that problem but creates another problem (such as insiderID 396157). It's really very complicated.

Results from your revised code:

insiderID       year      CompanyISIN            tradedate        irregular     irregular

                                                                                       ID             Trade

3961572004US500643200030/06/200400
3961572004US500643200013/12/200401
3961572004US500643200014/12/200401
3961572005US500643200030/06/200500
3961572005US500643200018/07/200501
3961572006US500643200030/06/200600
3961572006US500643200007/07/200601
3961572007US500643200008/01/200701
3961572007US500643200027/06/200700
3961572007US500643200029/06/200700
3961572007US500643200009/07/200701
3961572008US500643200027/06/200800
3961572008US500643200007/07/200801
3961572009US500643200029/06/200900
3961572010US500643200009/07/200800
3961572010US500643200028/06/201000
3961572010US500643200008/07/201000
3961572010US500643200009/07/201000
3961572010US500643200012/07/201000
3961572011US500643200004/04/201101
3961572011US500643200005/04/201101
3961572011US500643200017/06/201100
3961572011US500643200008/07/201100
3961572011US500643200011/07/201100
3961572012US500643200018/06/201200
3961572012US500643200009/07/201200
3961572012US500643200010/07/201200
3961572012US500643200011/07/201200

Following was the result from your previous code. For these dates 18/07/2005,  07/07/2006,  09/07/2007,  07/07/2008, it shows 0.

insiderID       year      CompanyISIN            tradedate        irregular     irregular

                                                                                       ID             Trade

3961572004US500643200030/06/200400
3961572004US500643200013/12/200401
3961572004US500643200014/12/200401
3961572005US500643200030/06/200500
3961572005US500643200018/07/200500
3961572006US500643200030/06/200600
3961572006US500643200007/07/200600
3961572007US500643200008/01/200701
3961572007US500643200027/06/200700
3961572007US500643200029/06/200700
3961572007US500643200009/07/200700
3961572008US500643200027/06/200800
3961572008US500643200007/07/200800
3961572009US500643200029/06/200900
3961572010US500643200009/07/200800
3961572010US500643200028/06/201000
3961572010US500643200008/07/201000
3961572010US500643200009/07/201000
3961572010US500643200012/07/201000
3961572011US500643200004/04/201101
3961572011US500643200005/04/201101
3961572011US500643200017/06/201100
3961572011US500643200008/07/201100
3961572011US500643200011/07/201100
3961572012US500643200018/06/201200
3961572012US500643200009/07/201200
3961572012US500643200010/07/201200
3961572012US500643200011/07/201200

Please test your code for this trader as well. Your old code shows all 0 for December of 2004, 2005, 2006, 2010, 2011 but revised code shows 1 for all these years. But 0 should be only for 2004, 2005 and 2006. And 1 should only be for 2010 and 2011.

insiderID       year      CompanyISIN            tradedate

2839902004US150934883501/07/2004
2839902004US150934883510/12/2004
2839902005US150934883503/01/2005
2839902005US150934883505/04/2005
2839902005US150934883531/05/2005
2839902005US150934883501/07/2005
2839902005US150934883512/12/2005
2839902006US150934883503/01/2006
2839902006US150934883508/05/2006
2839902006US150934883518/09/2006
2839902006US150934883529/09/2006
2839902006US150934883514/12/2006
2839902007US150934883526/03/2007
2839902007US150934883527/03/2007
2839902008US150934883518/03/2008
2839902008US150934883519/03/2008
2839902009US150934883531/07/2009
2839902009US150934883511/09/2009
2839902009US150934883525/09/2009
2839902009US150934883522/10/2009
2839902009US150934883510/11/2009
2839902010US150934883525/03/2010
2839902010US150934883529/06/2010
2839902010US150934883530/06/2010
2839902010US150934883509/08/2010
2839902010US150934883510/08/2010
2839902010US150934883529/09/2010
2839902010US150934883530/09/2010
2839902010US150934883505/11/2010
2839902010US150934883510/12/2010
2839902010US150934883515/12/2010
2839902010US150934883516/12/2010
2839902011US150934883530/12/2010
2839902011US150934883531/12/2010
2839902011US150934883526/01/2011
2839902011US150934883527/01/2011
2839902011US150934883523/02/2011
2839902011US150934883525/03/2011
2839902011US150934883513/05/2011
2839902011US150934883529/07/2011
2839902011US150934883526/08/2011
2839902011US150934883516/09/2011
2839902011US150934883521/09/2011
2839902011US150934883530/09/2011
2839902011US150934883528/10/2011
2839902011US150934883530/11/2011
2839902011US150934883529/12/2011
2839902012US150934883527/01/2012
2839902012US150934883524/02/2012
2839902012US150934883521/03/2012
2839902012US150934883530/03/2012
2839902012US150934883527/04/2012
2839902012US150934883528/06/2012
2839902012US150934883529/06/2012
2839902012US150934883530/07/2012
2839902012US150934883510/08/2012
2839902012US150934883522/08/2012
2839902012US150934883531/08/2012
2839902012US150934883520/09/2012
2839902012US150934883526/09/2012
Super Contributor
Posts: 644

Re: Creating dummy variables

Posted in reply to AbuChowdhury

Abu

The condition

  if year = sum(prev_year, 1) then testmonth + 1 ;

  else testmonth = 0 ;

is too simple, it is causing the earlier results to be lost with the reset to zero.

I am not able to verify right now but I think the code at that point should be something like this:

  if year = sum(prev_year, 1) then testmonth + 1 ;

  else if testmonth >= 2 then

       do ;

       testID = 1 ;

       testend = year ;

       output check_months ;

       testmonth = 0 ;

       end ;

  else testmonth = 0 ;


Richard


Regular Contributor
Posts: 183

Re: Creating dummy variables

Posted in reply to RichardinOz

Hi, thanks a lot for your continuous support. This code solves the problem for insiderID 396157. But it still did not solve the case for insiderID 283990. IrregularTrade should be 1 for 2010 and 2011. But for 2010, the value shows 0. I see in the "check_months" table that it takes "year" as 2010 and "prev_year" as 2006, so "teststart" as 2004 and testend as "2010". But "year" should be 2006 and so "testend" should be 2006. Please note that there are trades in december of 2004, 2005, 2006, 2010 and 2011.


insiderID       year     trade_month     prev_year     test_month         testID     teststart     testend

283990          2012          3                    2011               2                    1          2010          2012

283990          2012          8                    2011               2                    1          2010          2012

283990          2012          9                    2011               3                    1          2009          2012

283990          2011         11                   2010               2                    1          2009          2011

283990          2010         12                   2006               2                    1          2004          2010




I see the following in "check_months" table for insiderID 396157.

insiderID       year     trade_month     prev_year     test_month         testID     teststart     testend

396157         2012            6                 2011                8                    1          2004          2012

396157         2010            7                 2008                3                    1          2005          2010

396157         2012            7                 2011                2                    1          2010          2012

So this problem is because of identifying the "prev_year".

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 26 replies
  • 739 views
  • 1 like
  • 2 in conversation