- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
my dates in this cirque du soleil data come in rather jumbled up from the internet.
I'd like to figure out longest standing shows, which means I need start date & end date.
I can always extract the last word & if "present" use today() as enddate.
dates with a single value are one year events so those also get ruled out.
but how do I extract start & end dates for the remaining values where the date are stored in all kinds of text strings.
scratched my head with the scan function, & not much time left to complete my ppt so any tips greatly appreciated.
thank you
Date |
30 June 2018–present |
13 July 2013 – 30 July 2017 |
24 June 2009 – 1 September 2013 |
38549 |
13–28 December 2014 |
38179 |
17–19 October 2008 |
15 July 2015–present |
34851 |
42195 |
May–September 2010 |
6 February 2016 – 23 December 2020 |
12 June 1997 – 28 May 2000 |
13 May–30 August 2015 |
16 June–9 September 2017 |
27 November 2019 – 18 December 2022 |
14 August 2018 – 19 August 2019 |
2–10 January 2016 |
2 May 2022–present |
23 September 2018—present |
June–September 2008 |
20 March–29 November 1992 |
39117 |
40965 |
37339 |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charu,
I typically try to stay away from using ANYDTDTE, but for a mess like this, it may be helpful.
I took a brute force approach, I'm sure there are much better ways. I ignore any runs that don't have a valid start date and end date. I treat 'June–September 2008' as an invalid start date, because since there is no start year, it must be less than one year, so we don't need to worry about it.
But this is an ugly hack...
data have ;
input date $50. ;
startc=scan(date,1,'–—') ;
endc=scan(date,2,'–—') ;
if endc='present' then endc=put(today(),date9.) ;
if length(startc)>5 then startn=input(startc,?? anydtdte60.) ;
if length(endc)>5 then endn=input(endc,?? anydtdte60.) ;
if n(startn,endn)=2 then duration=endn-startn ;
format startn endn date9. ;
cards ;
30 June 2018–present
13 July 2013 – 30 July 2017
24 June 2009 – 1 September 2013
38549
13–28 December 2014
38179
17–19 October 2008
15 July 2015–present
34851
42195
May–September 2010
6 February 2016 – 23 December 2020
12 June 1997 – 28 May 2000
13 May–30 August 2015
16 June–9 September 2017
27 November 2019 – 18 December 2022
14 August 2018 – 19 August 2019
2–10 January 2016
2 May 2022–present
23 September 2018—present
June–September 2008
20 March–29 November 1992
39117
40965
37339
;
run ;
Looking forward to seeing you at SAS Explore!
Next up: SAS Trivia Quiz hosted by SAS on Wednesday May 21.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Define the cases.
Provide a solution for each case.
If no -, (note: check your data for what character is actually the separator. The 'dash' I get copy and pasting to my editor is not a dash but a high order ASCII character that depends on font to display. That might be a result of pasting into the forum or what ever was used to create that gridlike appearance.) is present it should be a single date and a single INPUT function call should work on the value.
Attempt to convert the part before /after the - for the start/end date with appropriate informat. If the date fails it needs intervention.
Cases with a numeric value only for the first should be the same month and year of the enddate so you get mdy(month(enddate),firstday,year(enddate))
When you identify a part that has no day of the month you get to impute some value for the day of the month to set a date.
Here is a couple of cases. Similar for missing value and parsing the bits of the string to but into different cases.
data have; infile datalines truncover; input string $50.; startdate= input(scan(string,1,'–'),?? anydtdte32. ); enddate= input(scan(string,2,'–'),?? anydtdte32. ); if missing(startdate) then do; /* check if the first bit is numeric*/ firstday =input(scan(string,1,'–'),?? f4. ); if firstday>0 then startdate= mdy(month(enddate),firstday,year(enddate)); end; format startdate enddate date9.; datalines; 6 February 2016 – 23 December 2020 13–28 December 2014 ;
Note use of the ANYDTDTE function can be locale dependent when you get things like 12/10/12 as which gets "guessed" to be month, day and year but that does not appear to be the case with your example data.
The ?? in the INPUT function calls is to suppress the known invalid data messages so the log is somewhat manageable.
A couple of case it may be easier to create the word date value and use the INPUT function.
The stupid almost certainly spreadsheet numeric values of 34851 and such I suggest a step before "fixing". Use spreadsheet to set the column to date type, even if most of the values aren't, set a know display layout for the dates. Export the data as CSV and read the file with a datastep. That should get the single dates to appear as such and not need the post hoc intervention of adjusting from 1Jan1900 that is a moderately well know problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charu,
I typically try to stay away from using ANYDTDTE, but for a mess like this, it may be helpful.
I took a brute force approach, I'm sure there are much better ways. I ignore any runs that don't have a valid start date and end date. I treat 'June–September 2008' as an invalid start date, because since there is no start year, it must be less than one year, so we don't need to worry about it.
But this is an ugly hack...
data have ;
input date $50. ;
startc=scan(date,1,'–—') ;
endc=scan(date,2,'–—') ;
if endc='present' then endc=put(today(),date9.) ;
if length(startc)>5 then startn=input(startc,?? anydtdte60.) ;
if length(endc)>5 then endn=input(endc,?? anydtdte60.) ;
if n(startn,endn)=2 then duration=endn-startn ;
format startn endn date9. ;
cards ;
30 June 2018–present
13 July 2013 – 30 July 2017
24 June 2009 – 1 September 2013
38549
13–28 December 2014
38179
17–19 October 2008
15 July 2015–present
34851
42195
May–September 2010
6 February 2016 – 23 December 2020
12 June 1997 – 28 May 2000
13 May–30 August 2015
16 June–9 September 2017
27 November 2019 – 18 December 2022
14 August 2018 – 19 August 2019
2–10 January 2016
2 May 2022–present
23 September 2018—present
June–September 2008
20 March–29 November 1992
39117
40965
37339
;
run ;
Looking forward to seeing you at SAS Explore!
Next up: SAS Trivia Quiz hosted by SAS on Wednesday May 21.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Numbers like 34851 are possibly "Excel dates", which can be converted to SAS dates by subtracting 21916. Or maybe they mean something else?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content