Dear community,
after getting enormous help on two issues of mine, I beg for your help once again.
My problem is the following:
I have a data set, with month-end-values for 20 years (e.g. 20 observations for 31/08/1971, 37 observations for 30/09/1971, ...)
For every "date point", I would like to have a variable for identification purposes (:= date_id).
If my first date-point is 31/08/1971, every observation in this "group" shall get the value "1" for date_id;
Every observation in 30/09/1971 shall get a "2" and so on.
I attached an excerpt of the way of output I strive for (the variable date_id is the variable I would like to create).
It would be great, if you could consider my problem.
Yours sincerely,
Sinistrum
Does below give you what you're after?
data have;
infile datalines dlm=' ' truncover;
input Date:date9. name $ wage ;
format date date9.;
datalines;
31AUG1971 Hans 1000
31AUG1971 Peter 2000
31AUG1971 Lisa 500
30JUL1971 Tina 1700
30JUL1971 Peter 250
;
run;
proc sort data=have;
by date;
run;
data want;
length date_id 8;
set have;
by date;
if first.date then date_id+1;
run;
How will you use this new id?
Why can't you use date directly?
If the data is sorted by date, use a data step, BY date.
Then increase your id +1 whenever first.date is true.
Once again, I received exactly the help I needed (thus, yes, Patrick, above gives me what I am after). Thank you very much, both of you!
@LinusH wrote:How will you use this new id?
Why can't you use date directly?
If the data is sorted by date, use a data step, BY date.
Then increase your id +1 whenever first.date is true.
I want to use this new ID for a "expanding" regression I do have to run.
My sample is to be divided into two parts - an "in-sample" part, where I get parameter-estimates and an "out-of-sample" part, where I do apply the parameter-estimates based on the prior information in such a kind, that I use them to obtain an estimate for the next data point (e.g. from January 1970 until December 1998, I run the regression each month, of ret_realzd on estor_ret, with "returns_realized" := average realized returns and "estor_ret" := average estimator for returns; then, in January 1999, I multiply my estimated slope with the then-prevailing estor_ret, add the intercept and look how well I approximate the realized return, by considering the “error”);
Then, for the next set of regressions, the period of estimation prolongs by one month, which is to say, that I run the regression from January 1970 up to January 1999; then, I investigate, with the same procedure, how well I can approximate the realized return as of February 1999.
I am considering to implement this procedure with a DO-Loop; From t=70 (e.g., end of first insample-period) to T=130 (total sample-end);
with every increment, the data-set used for regression "grows" by one step.
So, for me, as someone without any SAS-experience, it is very convenient to have a date_id; I consider to use this in correspondence with my planned DO-loop.
So to ask a similar question like @LinusH:
If you're using monthly data then couldn't you just use dates aligned to month end? Wouldn't that group your data in exactly the same way than creating such a "date_id"?
If it's "the same" then that would be the preferred way as it will work with any data and you will also never have to implement some "complicated" logic when combining data from multiple sources (like figuring out the starting point for incrementing "date_id").
Dear Patrick,
I considered your remark for ~ three days now and still am too bovine to think of an appropriate answer.
Sorry for that.
For as I do not want to leave your question unanswered (especially faced with the fact I received such great and quick help), I would like to try to show how I implemented the "dte_id".
No doubt that my "program" and whole approach might be horrible and is poles apart from a solution a versed programmer would be capable of coming up with, but still, it seemingly is working for me.
*******************************************************************************
Run 'expanding' regression, copying each paramter-estimates to next observation;
*'*'*'*'*'*'*'*'*'*'*'
MA_m&u._i&t. :=
Moving average of predictor variable at end of month
[Note: &u. and &t. denote different Modell specifications, "m" for model,
"i" for ICC (my 'predictor')]
slnrpk1 :=
realized return at end of NEXT month (therefore "1")
vw_ts :=
Complete value weighted time series;
'Appending' Regression;
%macro insample;
%DO i=354 %TO 557;
%LET rgno = &i.;
data vw_ts_m&u._i&t._&rgno.;
set vw_ts_m&u._i&t.;
if dte_id LE &i.;
run;
%ODSoff;
proc model
data=vw_ts_m&u._i&t._&rgno.;
endo slnrpk1;
exog MA_m&u._i&t.;
instruments _exog_;
parms a b;
slnrpk1 = a+b*MA_m&u._i&t.;
fit slnrpk1 / gmm kernel=(bart,1,0) vardef=n;
ods output ParameterEstimates=Parameters;
run; quit;
%ODSon;
* Isolate b0;
data intercept; set Parameters; if Parameter="a"; run;
data intercept; set intercept; keep Estimate;
data intercept; set intercept; rename Estimate=a; run;
* Isolate b1;
data steigung; set Parameters; if Parameter="b"; run;
data steigung; set steigung; keep Estimate;
data steigung; set steigung; rename Estimate=b; run;
** Merge;
data slopes; merge intercept steigung; run;
data slopes; set slopes; dte_id = &i.+1; run;
*** Attach paramters to "next" data point, in order to be capable of getting forecast and error and such;
data Vw_ts_m&u._i&t.; merge Vw_ts_m&u._i&t. slopes; by dte_id; run;
*** house cleaning;
proc datasets library=work nolist; delete vw_ts_m&u._i&t._&rgno.; run; quit;
%END;
%mend insample;
%insample;
So for that "DO" Loop, I needed my "dte_id", because, I thought, I were to need an interger to implement the loop.
O.K., I believe I understand now what you're trying to do. What if you would calculate the number of months since - let's say 1Jan1960 - based on your date. This would also give you such a continuous number BUT it would strictly depend on the actual dates and you would always get the same number even if using different data sets.
So what I'm thinking about is something like:
data want;
length dte_id 8;
set have;
dte_id=intck('month',"01jan1960"d,date);
run;
So now in your macro you could then do the same - and you could also calculate the start and end point of your do loop based on a date as done in below code:
%macro insample(start_date, stop_date);
%let start_dte_id =%sysfunc(intck(month,"01jan1960"d,"&start_date"d));
%let stop_dte_id =%sysfunc(intck(month,"01jan1960"d,"&stop_date"d));
%put &=start_dte_id;
%put &=stop_dte_id;
%DO i=&start_dte_id %TO &stop_dte_id;
%LET rgno = &i.;
data vw_ts_m&u._i&t._&rgno.;
set vw_ts_m&u._i&t.;
if dte_id LE &i.;
run;
%ODSoff;
proc model
data=vw_ts_m&u._i&t._&rgno.;
endo slnrpk1;
exog MA_m&u._i&t.;
instruments _exog_;
parms a b;
slnrpk1 = a+b*MA_m&u._i&t.;
fit slnrpk1 / gmm kernel=(bart,1,0) vardef=n;
ods output ParameterEstimates=Parameters;
run;
quit;
%ODSon;
** Merge;
data slopes;
merge
intercept (keep=Parameter Estimate where=(Parameter="a") rename=(Estimate=a))
steigung (keep=Parameter Estimate where=(Parameter="b") rename=(Estimate=b))
;
drop Parameter;
dte_id = &i.+1;
run;
*** Attach paramters to "next" data point, in order to be capable of getting forecast and error and such;
data Vw_ts_m&u._i&t.;
merge Vw_ts_m&u._i&t. slopes;
by dte_id;
run;
*** house cleaning;
proc datasets library=work nolist;
delete vw_ts_m&u._i&t._&rgno.;
run;
quit;
%END;
%mend insample;
%insample(01jan2015, 01aug2015);
I've also reduced the number of data steps in your code as you're creating these data steps in every single loop of your macro - so when looping 200 times taking out 6 data steps could decrease run times and resource consumption quite a bit (depending on your data volumes of course).
There would have been some more room to even further reduce passes through data - but it would have complicated the code and I felt it's only worth it if you would have performance issues and you haven't raised anything like that.
So, you leave me really stunned.
This is enormous - thank you so much, it is really awesome. All this time you have taken.
I indeed do have performance issues, such that I planned running the regressions during night.
First and foremost, I have to look up all the statements used by you for I partially never have come across them.
Again, thank you very much indeed.
The help received here thus for is truly great.
You're welcome. I couldn't test the code so hope there are no syntax errors in it.
Give it a go and should you still have serious performance issues then I'd suggest you open a new discussion and just reference this one.
If further performance improvements would be required then you would need to attach a log so we can see what volumes you're dealing with and where the time gets spent. If posting a log please run your code with options "option fullstimer mprint;" set.
Hello,
First: Sorry once again for my delayed answer.
@Patrick wrote:You're welcome. I couldn't test the code so hope there are no syntax errors in it.
One thing appreas to be missing (as I see it):
** Merge;
data slopes;
merge
intercept (keep=Parameter Estimate where=(Parameter="a") rename=(Estimate=a))
steigung (keep=Parameter Estimate where=(Parameter="b") rename=(Estimate=b))
;
drop Parameter;
dte_id = &i.+1;
run;
"Intercept" and "steigung" have never been referenced in your code - did you just miss them or can I even merge data sets not having been created (so to say: create data sets in the very data step which entails merging)?
So, I changed it to this:
** Merge;
data slopes;
merge
Parameters (keep=Parameter Estimate where=(Parameter="a") rename=(Estimate=a))
Parameters (keep=Parameter Estimate where=(Parameter="b") rename=(Estimate=b))
;
drop Parameter;
dte_id = &i.+1;
run;
(which is, de facto, merging data sets not even having been created - I do not need the data sets "intercept" and "steigung" at all, as I see it now - six steps less).
Otherwise: Once again, a big thank you. I am honestly still stunned.
Especially, this "(keep rename ...)" procedure during data steps helps me a lot.
If I consider, how many useless data steps are in my "programs". My oh my...
@Patrick wrote:[...]
Give it a go and should you still have serious performance issues then I'd suggest you open a new discussion and just reference this one.
If further performance improvements would be required then you would need to attach a log so we can see what volumes you're dealing with and where the time gets spent. If posting a log please run your code with options "option fullstimer mprint;" set.
Well, this would make me feel even more like a leech.
It would feel like "well, here, look you all, for I have written a program to achieve this and that, go tidy it up, as I do not have time for that."
I already have a bad conscious because of all the extensive answers I received thus far - and still have multiple questions.
Hi @Sinistrum
Cool! Your fix of my code shows me that you've actually used your own brain to understand what I've tried to do (instead like others just reply "it's not working"). That's really all I'm asking for when posting an answer.
"Well, this would make me feel even more like a leech.
It would feel like "well, here, look you all, for I have written a program to achieve this and that, go tidy it up...."
This forum has quite a few very capable people who just like to help. The expectation is that you're not only consuming but take your time to formulate the question, provide sample data and expected result (if possible), give feedback and answer questions for clarification and mark the correct or chosen answer as "answered" (to close the track).
Best is to only ask one question per thread and start new threads with follow up questions (referencing the original thread). In doing so this forum also becomes a knowledge base with simple question/answer constructs; and there will also be more people looking into new and unanswered threads.
So: Just keep doing what you're doing and don't feel bad about it. It's everybody's own choice to spend the time to answer and you can assume people take this time because they also get something out of it. I for myself learn constantly new things from other people's answers and stay "sharp" with coding in times where I'm in my professional live mainly busy writing designs and the like.
Does below give you what you're after?
data have;
infile datalines dlm=' ' truncover;
input Date:date9. name $ wage ;
format date date9.;
datalines;
31AUG1971 Hans 1000
31AUG1971 Peter 2000
31AUG1971 Lisa 500
30JUL1971 Tina 1700
30JUL1971 Peter 250
;
run;
proc sort data=have;
by date;
run;
data want;
length date_id 8;
set have;
by date;
if first.date then date_id+1;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.