BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chimukah
Obsidian | Level 7

Hi Everyone,

 

Please i have a partial date problem for both starting of and ending of adverse event.

the aestdtc = 18-11-202

      aeendtc= U-U-2020

How do i impute this partial dates knowing fully well that other dates in the column have length as

     aestdtc = 7-5-2020 where 7 corresponds to Day and 5 corresponds to Month.

 

Warm regards

 

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

@chimukah wrote:

Please i have a partial date problem for both starting of and ending of adverse event.

the aestdtc = 18-11-202

      aeendtc= U-U-2020

How do i impute this partial dates knowing fully well that other dates in the column have length as

     aestdtc = 7-5-2020 where 7 corresponds to Day and 5 corresponds to Month.


Well, this is a bit difficult.  You may not be able to correctly determine some dates given partial dates like these.  However, there are some things that you can do:

Parse the data.  First, you would want to break the date into its Day, Month, and Year components using the SCAN function with '-' as your delimiter.

 

Correct the data.  Second, you would create correction rules such as:

  1. Start dates that are invalid will default to January 1st of the year indicated by the date.
  2. End dates that are invalid will default to December 31st of the year indicated in the date.
  3. Years equal to 202 will be set to 2020.

You could make more sophisticated rules, like setting a range on what years are valid or imputing the start year from the end year if the end year is valid, but the above are some ideas to start with.

 

Check the data.  Third, you would want to identify illogical or invalid conditions that remain after your parsing and correcting.  The End should not be before the Start, for example.  Perhaps you would want to delete records with illogical conditions from the final data or at least set the illogical values to missing.

 

Below is a sample program that encodes these steps (parse, correct, check).  The program probably won't do everything you need, but you can use it as a starting point.

 

Jim

 

DATA	Corrected_Dates	(KEEP=Final_aestdtc Final_aeendtc);
	FORMAT	Final_aestdtc	DATE9.;
	FORMAT	Final_aeendtc	DATE9.;

	INFILE	Datalines		DSD	DLM='09'X	MISSOVER;

	INPUT
		aestdtc	$	
		aeendtc	$
		;

	Work_Date				=	aestdtc;
	Type					=	'START';
	LINK	Parse_Date;
	Final_aestdtc			=	SAS_Date;

	Work_Date				=	aeendtc;
	Type					=	'END';
	LINK	Parse_Date;
	Final_aeendtc			=	SAS_Date;

	IF	MISSING(Final_aestdtc)					THEN
		DO;
			PUTLOG	"WARNING:  Invalid start date on record "  _N_=  COMMA17.;
		END;

	IF	MISSING(Final_aeendtc)					THEN
		DO;
			PUTLOG	"WARNING:  Invalid end date on record "  _N_=  COMMA17.;
		END;

	IF	Final_aeendtc		<	Final_aestdtc	THEN
		DO;
			PUTLOG	"WARNING:  End before start on record "  _N_=  COMMA17.;
			CALL	MISSING(Final_aestdtc, Final_aeendtc);
		END;

	******;
	RETURN;
	******;

	**********;
	Parse_Date:
	**********;
		Day					=	SCAN(Work_Date, 1, '-');
		Month				=	SCAN(Work_Date, 2, '-');
		Year				=	SCAN(Work_Date, 3, '-');

		IF	Type			=	'START'			THEN
			DO;
				IF	NOTDIGIT(STRIP(Day))		THEN
					Day		=	'1';
				IF	NOTDIGIT(STRIP(Month))		THEN
					MONTH	=	'1';
			END;

		IF	Type			=	'END'			THEN
			DO;
				IF	NOTDIGIT(STRIP(Day))		THEN
					Day		=	'31';

				IF	NOTDIGIT(STRIP(Month))		THEN
					MONTH	=	'12';
			END;

		IF	Year			=	'202'			THEN
			Year			=	'2020';

		SAS_Date			=	MDY(INPUT(STRIP(Month), 2.), INPUT(STRIP(Day), 2.), INPUT(STRIP(Year), 4.));
	******;
	RETURN;
	******;

DATALINES;
18-11-202	U-U-2020
;
RUN;

View solution in original post

7 REPLIES 7
jimbarbour
Meteorite | Level 14

@chimukah wrote:

Please i have a partial date problem for both starting of and ending of adverse event.

the aestdtc = 18-11-202

      aeendtc= U-U-2020

How do i impute this partial dates knowing fully well that other dates in the column have length as

     aestdtc = 7-5-2020 where 7 corresponds to Day and 5 corresponds to Month.


Well, this is a bit difficult.  You may not be able to correctly determine some dates given partial dates like these.  However, there are some things that you can do:

Parse the data.  First, you would want to break the date into its Day, Month, and Year components using the SCAN function with '-' as your delimiter.

 

Correct the data.  Second, you would create correction rules such as:

  1. Start dates that are invalid will default to January 1st of the year indicated by the date.
  2. End dates that are invalid will default to December 31st of the year indicated in the date.
  3. Years equal to 202 will be set to 2020.

You could make more sophisticated rules, like setting a range on what years are valid or imputing the start year from the end year if the end year is valid, but the above are some ideas to start with.

 

Check the data.  Third, you would want to identify illogical or invalid conditions that remain after your parsing and correcting.  The End should not be before the Start, for example.  Perhaps you would want to delete records with illogical conditions from the final data or at least set the illogical values to missing.

 

Below is a sample program that encodes these steps (parse, correct, check).  The program probably won't do everything you need, but you can use it as a starting point.

 

Jim

 

DATA	Corrected_Dates	(KEEP=Final_aestdtc Final_aeendtc);
	FORMAT	Final_aestdtc	DATE9.;
	FORMAT	Final_aeendtc	DATE9.;

	INFILE	Datalines		DSD	DLM='09'X	MISSOVER;

	INPUT
		aestdtc	$	
		aeendtc	$
		;

	Work_Date				=	aestdtc;
	Type					=	'START';
	LINK	Parse_Date;
	Final_aestdtc			=	SAS_Date;

	Work_Date				=	aeendtc;
	Type					=	'END';
	LINK	Parse_Date;
	Final_aeendtc			=	SAS_Date;

	IF	MISSING(Final_aestdtc)					THEN
		DO;
			PUTLOG	"WARNING:  Invalid start date on record "  _N_=  COMMA17.;
		END;

	IF	MISSING(Final_aeendtc)					THEN
		DO;
			PUTLOG	"WARNING:  Invalid end date on record "  _N_=  COMMA17.;
		END;

	IF	Final_aeendtc		<	Final_aestdtc	THEN
		DO;
			PUTLOG	"WARNING:  End before start on record "  _N_=  COMMA17.;
			CALL	MISSING(Final_aestdtc, Final_aeendtc);
		END;

	******;
	RETURN;
	******;

	**********;
	Parse_Date:
	**********;
		Day					=	SCAN(Work_Date, 1, '-');
		Month				=	SCAN(Work_Date, 2, '-');
		Year				=	SCAN(Work_Date, 3, '-');

		IF	Type			=	'START'			THEN
			DO;
				IF	NOTDIGIT(STRIP(Day))		THEN
					Day		=	'1';
				IF	NOTDIGIT(STRIP(Month))		THEN
					MONTH	=	'1';
			END;

		IF	Type			=	'END'			THEN
			DO;
				IF	NOTDIGIT(STRIP(Day))		THEN
					Day		=	'31';

				IF	NOTDIGIT(STRIP(Month))		THEN
					MONTH	=	'12';
			END;

		IF	Year			=	'202'			THEN
			Year			=	'2020';

		SAS_Date			=	MDY(INPUT(STRIP(Month), 2.), INPUT(STRIP(Day), 2.), INPUT(STRIP(Year), 4.));
	******;
	RETURN;
	******;

DATALINES;
18-11-202	U-U-2020
;
RUN;
chimukah
Obsidian | Level 7
Thank you Jimbarbour for your kind explanation and assistance in resolving this puzzle. I will implement same and see the outcome.
Best regards,
Chimuka
chimukah
Obsidian | Level 7
Hi Jimbarbour,
i was able to resolve that by applying the following code below for the AESTDTC. However, since the AEENDTC is U, i guess its unknown and it has to be missing.

*Y/M/D for partial date;
yearst = scan(aestdtc,3,'-');
monthst = scan(aestdtc,2,'-');
dayst = scan(aestdtc,1,'-');
if length(yearst) =3 then yearst=compress(yearst||'0');
AESTDT=mdy(monthst,dayst,yearst);

Regards
jimbarbour
Meteorite | Level 14

@chimukah wrote:
since the AEENDTC is U, i guess its unknown and it has to be missing.

Yes, that's probably best.  As @Patrick points out, it's typically better to set a value to missing than to give a value that will produce invalid or misleading results.  If you're not sure, the safest course is to set to missing.

 

The little program I showed you can be used as a sort of pattern or template for the parse-clean-check process, but you must determine what cleaning rules are appropriate for your situation and your data.  My program is meant to illustrate a technique but by no means should you accept my sample rules unless they really fit your data.

 

Jim

Patrick
Opal | Level 21

First of all you need to define the rules how to interpret incomplete dates. 

Do you know/can you say what dates your incomplete dates actually stand for (for all cases).

 

IF you can define exhaustive rules then you can write the code for it. What @jimbarbour posted is code to implement such rules - but are the rules used really sufficient for your actual data?

 

i.e. should aestdtc = 18-01-202 become 18Nov2020 or 18Nov2021? What other information do you have to define a rule which would provide the correct answer? Can there be a rule at all? Could you always know by looking at the data? 

 

...and of course ideally: Instead of trying to fix bad data go back to the the source/data provider and request data of better quality.

chimukah
Obsidian | Level 7
Thank you Patrick.
However when the database is locked and the source data cannot be changed then only data programming can correct such anomalies. In this case it should be 18NOV2020.
Thank you.
Patrick
Opal | Level 21

Then you need to list and analyze all the string you can't convert to a SAS date value and define the rules for conversion. I'd consider insufficient rules converting incomplete strings to the wrong SAS date values as even worse than having to deal with missing dates. Creating "inappropriate" dates will mask DQ issues and can lead to wrong conclusions.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3197 views
  • 6 likes
  • 3 in conversation