Hi
I have a variable "date" that can be filled out a max of 175 times for each ID, but I get anywhere from 10 to 50 maybe one or two Ids would have 175 dates. I need to be able to run frequencies to find out - for each id- what was the last date filled out.
Is there a data step where I can create a variable "X" that would allow me to then run frequencies for the last date to know which date was the last filled out for each ID?
I was thinking of the if first then but use if last instead, would that work ?
if last.date then X= date;
I am thinking this would single out the last date for each ID in one general command. Am I thinking correctly ?
I also need to do a similar thing but in this next case I need to sum all of the other variables, but only those that were coded as 1.
I will give the example of variable "Y"
Variable Y could have been filled out 175 times but I can get anything from 1 to 175 for each of my Ids.
I already did an array to be able to single out the ones that were filled out as 1 but now I need to be able to sum them and they will vary per ID.
So one Id will have 10 - and another would have 5 etc..It is a stacked file.
I need to have a way to sum only the variables that were filled out (=1) and leave the missing out of the sum. To do this for each ID would take too long, as it varies. I would need to know which ones to keep and which ones to leave out of the sum.
Is there a general command that would allow me to create a variable that would be the sum of that variable Y ? but only for the ones that were filled out =1? Leaving the missing out?
TIA
For each ID, do you have up to 175 observations?
Or each each ID, do you have 175 variables, some of which may contain a missing value?
Either way, the programming is straightforward, but the answer depends on what the data looks like.
I have 175 possible dates of which some might be missing and I need to single out the last date for each ID.
For my other variables - not the date variables but the variables I created an array for - I have about 200 variables. Variable Y was one example of these 200 variables.
These 200 variables file is a stacked file, so I will have one ID with 5 entries for variable Y
I will have another ID with 100 entries for variable Y and another with 50 entries for variable Y and I wanted to be able to have the total without having to sum the up manually.
I have about 230 Ids so it would be a lot to do the sum Id by Id and for 200 different variables !
Thanks.
If you set up arrays in a DATA step, computations might be relatively simple. For example:
data want;
set have;
array dates {175} date1-date175;
array yvals {175} y1-y175;
total_y = sum(of yvals{*});
lastdate = max(of dates{*});
run;
The logic for LASTDATE might depend on whether the dates are always in order, and whether any missing values might appear in the middle of the set of values. This sample code captures the latest date, regardless of which variable contains that value.
Hi @Mscarboncopy Can you plz post a sample of your data and your required output?
It looks something like this but I have 175 dates and 200 + ids. I need to be able to have the last date filled out for each ID - thus I thought of creating a variable X that would be the last date for each, as you can see the last date can be anywhere.
ID | date1 | date2 | date3 | date4 | date5 | date6 | date7 | date8 | date9 | date10 | date11 |
1 | 1/1/200 | 2/2/2006 | |||||||||
2 | 1/2/2003 | 2/2/2005 | 3/3/2008 | 10/10/2009 | |||||||
3 | 2/6/2007 | 3/3/2008 | 10/10/2008 | 11/11/2008 | 12/19/2010 | 12/20/2011 | |||||
4 | 3/7/2008 | 4/4/2009 | 5/5/2010 | ||||||||
5 | 5/5/2010 | 6/6/2011 | 10/10/2008 | 11/11/2008 | 12/19/2010 | 12/20/2011 | 1/1/2012 | 3/3/2012 | 5/5/2013 | ||
6 | 6/6/2011 | 1/1/2012 | 3/5/2013 | ||||||||
7 | 1/1/2005 | 2/2/2005 | 3/3/2008 | 10/10/2009 | 9/9/2010 | ||||||
8 | 10/3/2000 | 11/13/2001 | 6/16/2005 | 7/7/2008 | 12/19/2010 | 12/20/2011 | 1/1/2012 | 3/3/2012 | 5/5/2013 | 10/3/2014 | 11/3/2014 |
9 | 11/11/2001 | 12/19/2011 | 6/7/2012 | 7/14/2012 | |||||||
10 | 2/2/2006 |
ok, lead us from here
data have;
infile cards truncover;
input ID (date1 date2 date3 date4 date5 date6 date7 date8 date9 date10 date11) (:mmddyy10.);
format date: mmddyy10.;
cards;
1 1/1/2002 2/2/2006
2 1/2/2003 2/2/2005 3/3/2008 10/10/2009
;
data want;
set have;
array t(*) date:;
last_date=t(n(of t(*)));
format last_date mmddyy10.;
run;
is my understading correct?
data want;
set have;
array t(*) date:;
last_date=t(n(of t(*)));
format last_date mmddyy10.;
run;
I don't know. I don't understand the array, shouldn't the array be all of the date variables ?
Someone else suggested this and I can understand it a little better this way:
array dates {175} date1-date175;
lastdate = max(of dates{*});
Would this be the same idea of what you were thinking? I have not been able to try it yet but will do asap.
Thanks
@Mscarboncopy wrote:
data want; set have; array t(*) date:; last_date=t(n(of t(*))); format last_date mmddyy10.; run;
I don't know. I don't understand the array, shouldn't the array be all of the date variables ?
Someone else suggested this and I can understand it a little better this way:
array dates {175} date1-date175;
lastdate = max(of dates{*});
Would this be the same idea of what you were thinking? I have not been able to try it yet but will do asap.
Thanks
The array definition of: array t(*) date:;
uses another of the SAS shorthand lists the colon following the variable name in DATE: part says "use all variables whose names start with DATE.
Thank you for the clarification.
@Mscarboncopy wrote:
It looks something like this but I have 175 dates and 200 + ids. I need to be able to have the last date filled out for each ID - thus I thought of creating a variable X that would be the last date for each, as you can see the last date can be anywhere.
ID date1 date2 date3 date4 date5 date6 date7 date8 date9 date10 date11 1 1/1/200 2/2/2006 2 1/2/2003 2/2/2005 3/3/2008 10/10/2009 3 2/6/2007 3/3/2008 10/10/2008 11/11/2008 12/19/2010 12/20/2011 4 3/7/2008 4/4/2009 5/5/2010 5 5/5/2010 6/6/2011 10/10/2008 11/11/2008 12/19/2010 12/20/2011 1/1/2012 3/3/2012 5/5/2013 6 6/6/2011 1/1/2012 3/5/2013 7 1/1/2005 2/2/2005 3/3/2008 10/10/2009 9/9/2010 8 10/3/2000 11/13/2001 6/16/2005 7/7/2008 12/19/2010 12/20/2011 1/1/2012 3/3/2012 5/5/2013 10/3/2014 11/3/2014 9 11/11/2001 12/19/2011 6/7/2012 7/14/2012 10 2/2/2006
Do you ever have gaps in the variables such as Date5 is missing but Date6 is not?
Do you have cases where Date5 > Date6? Would you want Date5 as the result in such a case when Date6 was the last value recorded? Or would Date5>Date6 indicate an error condition (one if not both of Date5 and Date6 were entered incorrectly) and something else should be done?
No gaps and the dates should be in chronological order but yes there could be errors (I was not thinking about that) where one date might be out of chronological order. But those should not be the norm,
So a few cases could have an instance where the last date is NOT the actual last date because it could have been coded out of chronological order.
Thanks
@Mscarboncopy wrote:
No gaps and the dates should be in chronological order but yes there could be errors (I was not thinking about that) where one date might be out of chronological order. But those should not be the norm,
So a few cases could have an instance where the last date is NOT the actual last date because it could have been coded out of chronological order.
Thanks
So what should be done when out of order values are encountered?
I asked the question because of the potential of the MAX value, or a not-tested "last variable", might be problem case. And requires different code evaluate and consider.
The n(of t(*)) approach suggested by @novinosrin gets the right most element when there are no gaps (why I asked) but does not consider actual values.
Something like
if max(of t(*)) ne t[ n(of t(*))] then do <whatever you want when the largest value is not the last value> end;
may be wanted for the potential out of order for some data entry errors.
Good point, however, I want the max value not necessarily the last entry in the case of an error - so I thought that the max value was a good idea, since I need to know what was the last date (as in the most recent, i.e., max). The last entry should have been the most recent date (max) but might not have been (in case of an error).
Thanks again,
There is no need for ARRAY if you just want the MAX value. Just pass the variable list to the MAX() function instead of using it to define the array.
last_date = max(of date1-date175);
total_Y = sum(of y1-y175);
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.