BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kc2
Quartz | Level 8 Kc2
Quartz | Level 8

Hello,

 

I have a dataset with 2 variables, id and reason. the dataset has 240 rows. The issue I have is that the variable reason has date values in it but those values are not consistent in format:

e.g  HAVE

ID   REASON

1) Multiple calls 14May2018, 15May2018, 16May2018, 17May2018, 18May2018.

2) No contact made, participant had surgery on 20 Oct 2017

3) No response

4) no answer  18-19Jan18

5) no info not contactable 16/11/17.

 

I need to create a third variable and extract the date from the variable "reason" when available.

WANT

1)14May2018

1)15May2018

1)16May2018

1)17May2018

1)18May2018

2)20Oct17

3)

4)18Jan2018

4)19Jan2018

5)16Nov2017.

 

if there a way to copy  the content of the 240 rows of the variable reason in a subsequent data step in order to extract the date and assign he dates into a new variable date. The issue I am having sis that the dates are not in an identical format sometimes it is 16/11/17 , sometimes 12 dec, other times 18-19 jan

1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Manually may be the only option, consider the case:

5) no info not contactable 08/06/17

Is this 08Jun or 06Aug?  There are a fair few other examples out there, different formats methods of putting dates etc.

 

You could do some things in code:

start=substr(reason,findc(reason,"","d"));

For instance would find the first numeric in the string, if they are all like your example, that would trim it down a bit.  Even so, by the time you have programmed each check, it would be quicker to do it manually.  This is why free text fields are frowned upon on stats and databasing, next to impossible to get anything useful out of it.

View solution in original post

4 REPLIES 4
pau13rown
Lapis Lazuli | Level 10

i don't understand your final sentence ie your Q. You just have to write code eg substr and index functions to extract the date data and then derive the new date variable

Reeza
Super User

There isn’t an automated method to do this. For 240 records I’d almost consider doing it manually, probably about an hour worth of work. 

 

To do it programming - which is better in the long run - first isolate all your different type of entry. 

Handle the simplest first and work through them. 

 

Use SCAN to isolate each word in the statement and then test if it’s a date using INPUT() with the ANYDTE informat. That’ll probably take care of most of your cases. 

 

PROC FREQ, and text functions will be useful. 

Find, index, translate, compress, substr, scan, countw are some that you’ll need. 

 


@Kc2 wrote:

Hello,

 

I have a dataset with 2 variables, id and reason. the dataset has 240 rows. The issue I have is that the variable reason has date values in it but those values are not consistent in format:

e.g  HAVE

ID   REASON

1) Multiple calls 14May2018, 15May2018, 16May2018, 17May2018, 18May2018.

2) No contact made, participant had surgery on 20 Oct 2017

3) No response

4) no answer  18-19Jan18

5) no info not contactable 16/11/17.

 

I need to create a third variable and extract the date from the variable "reason" when available.

WANT

1)14May2018

1)15May2018

1)16May2018

1)17May2018

1)18May2018

2)20Oct17

3)

4)18Jan2018

4)19Jan2018

5)16Nov2017.

 

if there a way to copy  the content of the 240 rows of the variable reason in a subsequent data step in order to extract the date and assign he dates into a new variable date. The issue I am having sis that the dates are not in an identical format sometimes it is 16/11/17 , sometimes 12 dec, other times 18-19 jan


 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Manually may be the only option, consider the case:

5) no info not contactable 08/06/17

Is this 08Jun or 06Aug?  There are a fair few other examples out there, different formats methods of putting dates etc.

 

You could do some things in code:

start=substr(reason,findc(reason,"","d"));

For instance would find the first numeric in the string, if they are all like your example, that would trim it down a bit.  Even so, by the time you have programmed each check, it would be quicker to do it manually.  This is why free text fields are frowned upon on stats and databasing, next to impossible to get anything useful out of it.

Ksharp
Super User

It is very difficulty ,you need consider  all the pattern of DATE.

 

data have;
infile cards truncover;
input id $ reason $80.;
cards;
1)  Multiple calls 14May2018, 15May2018, 16May2018, 17May2018, 18May2018.
2)  No contact made, participant had surgery on 20 Oct 2017
3)  No response
4)  no answer  18-19Jan18
5)  no info not contactable 16/11/17
;
run;

data want;
 set have;
 pid=prxparse('/\d+\s*[a-z]{3}\s*\d{2,}|\d+\-\d+[a-z]{3}\d+|\d+\/\d+\/\d+/i');
 s=1;
 e=length(reason);
 call prxnext(pid,s,e,reason,p,l);
 do while(p>0);
   want=substr(reason,p,l);output;
   call prxnext(pid,s,e,reason,p,l);
 end;
 drop s e pid p l;
run;

proc print noobs;run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 640 views
  • 1 like
  • 5 in conversation