Re: Good data validating process?

elwayfan446 · Posted 10-01-2018 05:20 PM

Hello everyone,

I am trying to develop a good data validation process for data in an existing database. What I would like to do is be able to scan the data in each variable to make sure the format is the same and values are existent and correct. The purpose is to find data that is not formatted correctly so I can reformat and make sure the data in each observation is formatted correctly.

I have searched on Google and can't seem to find exactly what I am looking for. I am hoping you can help.

Thanks!

kiranv_ · Posted 10-01-2018 05:28 PM

instead of an very broad question. can you give examples, what you have and what you are looking for. I am very sure, someone will able to help you or direct you to appropriate resources.

elwayfan446 · Posted 10-01-2018 05:33 PM

Sure. Here is an example. I would like to validate that data in each of these variables are in the correct format listed. If there are any that don't match, I want to know so I can update/format them correctly.

Reeza · Posted 10-01-2018 05:58 PM

How do you know if something is formatted incorrectly? Is that a check in your head or do you want to check the data somehow?H
Have you looked at the characterize data task in EG or do you have access to SAS DataFlux?

elwayfan446 · Posted 10-01-2018 06:02 PM

Hey Reeza,

Right now, that data check is in my head but it is in line with what you see are the variable formats in the screenshot above. I would like to build something to:

1. Check all current data in the dataset to make sure it is all formatted the same and correct if necessary

2. Be able to ensure that new data being inserted is formatted correctly and spit out some sort of report that lets me know which data is not formatted how it should be

I use EG but have not looked at the characterize data task. I don't have access to SAS DataFlux.

ballardw · Posted 10-01-2018 06:00 PM

@elwayfan446 wrote:

Sure. Here is an example. I would like to validate that data in each of these variables are in the correct format listed. If there are any that don't match, I want to know so I can update/format them correctly.

I would say from that example that variables #7, #30, #2, #25, #29 and #4 all show a potential problem. The informat is less than the length of the variable, so depending on when/how the informat was used and the actual data you may have some truncation of data.

And if that is the case you would have to refer to a text file that was read to determine if the value in your data set is correct.

Comparison with external files may be needed for any of your informats since they affect how data is read.

Example:

data example;
   informat txt $5. ;
   length word $ 20;
   input word $5. txt ;
datalines;
Somelongishtext short
;
run;

The best time to validate some data is immediately after reading it. Waiting may make things harder, especially if external files the data is read from get replaced, removed or overwritten.

elwayfan446 · Posted 10-01-2018 06:05 PM

@ballardw, you are correct. I am trying to clean up someone elses mess I guess you could say. As I told Reeza, once I get this cleaned up I want to build a process that ensures the data is correct before being added to the master dataset.

ballardw · Posted 10-01-2018 07:02 PM

Validation rules generally start with a description of what the data should look like.

Some common things:

Which variables should never have missing values? Relatively easy to check: Proc freq Nlevels will tell if there are missing values for variables.

Data ranges: Variable X should have values of <some list goes here> examples: survey question category answers, product types or departments for categorical data. Numeric may have "natural range" such as Sales should be >= 0 (though I have seen some idiots use - values for sales as a return); data collection instruments may a theoretically unlimited range but practical limits: air temperature values of < -100F or > 150F are highly suspect anywhere and geography likely provides better limits.

Expected membership in a category is easily checked with custom informats and using the _error_ option. The following will only allow values of F and M to be accepted for the variable. Note that it complains about lowercase. So this is an exact comparison.

Proc format library=work;
invalue $gender
'F','M' = _same_
other = _error_
;
run;

data example;
   input gender $gender.;
datalines;
F
M
a
f
.
m
;
run;

numeric ranges can be tested similarly when read if you know a range. If missing is acceptable you would provide a line in the informat code for that so the Other doesn't catch it.

If you have values that you expect in a range but sometimes go out, do not make the values from the informat as errors but you might provide data checking code to report low or high values.

The next set of relatively easy checks are referential test based on two or more variables in the data. Can a persons Date of Birth be later than the date of a hospital visit for instance. It is easy to code things such as: If varb > vara then put "WARNING: Measure a exceeds b for " <id variables>;

Proper informat choices for date, time and datetime values can provide a lot of help when the data source gets fubarred. I have some dates from project that looked like: 12020121 If that date was supposed to be YYYYMMDD then the year was 1202, if the data was supposed to be MMDDYYYY then the year was 121. In either case highly improbable for an STD test.

Other forms of referential data depend on your knowledge of the subject and you data. Is the sale amount for this month much larger or smaller than seems "right" for that store/company/department/product. Snow shovel sales in the US tend to peak with the first regional major snowstorm. If a store sells a bunch of then in July (or December for the folks in the southern hemisphere) that might be an indicator.

Another referential might be you expect at least one record per time interval. So you have to examine the actual intervals. Or only one record per time interval, more than one would be an issue.

If there is a data layout or agreement document you may have some of the expected values, and possibly even edits performed before you see the data. If you have agreement or content documents that is the first place to look to Know Thy Data.

kiranv_ · Posted 10-01-2018 06:07 PM

one approach, this was done when comparing to excel file, this might be useful to you

/* first build a reference table/standard of what you really want
and the comparisions with your file*/
proc sql;
	create table ref_paylist
 (IdNum char(4), Gender char(1), Jobcode char(3), Salary num, Birth num, Hired 
		num informat=date7. format=date7.);

	/* secondly check the column names and capture non matching name immediately
 after you have read external data */
proc sql noprint;
	select case when name is missing then "ALL COLUMN NAMES MATCHED" else name end 
		into :COLUMN_NAME_MATCHORNOMATCH SEPARATED by ',' from
 (select coalesce(a.name, b.name) as name, count(*) from (Select name from 
		Dictionary.columns where upcase(libname)=upcase('work') and 
		upcase(memname)=upcase('ref_paylist'))a full join
 (Select name from Dictionary.columns
		/* this is your file*/
		where upcase(libname)=upcase('work') and 
		upcase(memname)=upcase('test_paylist1'))b on upcase(a.name)=upcase(b.name) 
		where a.name is missing or b.name is missing);

	/* compare the columns length and other things
	you can change to whatever you want*/
proc sql noprint;
	select case when name is missing then 
		"ALL COLUMN Type position length MATCHED" else name end 
		into :COLUMN_Values_MATCHorNOMATCH SEPARATED by ',' from
 (select coalesce(a.name, b.name) as name, count(*) from (Select name, varnum, 
		type, length from Dictionary.columns where upcase(libname)=upcase('work') and 
		upcase(memname)=upcase('ref_paylist'))a inner join
 (Select name, varnum, type, length from Dictionary.columns where 
		upcase(libname)=upcase('work') and upcase(memname)=upcase('test_paylist1'))b 
		on upcase(a.name)=upcase(b.name) where a.varnum ne b.varnum or a.type ne 
		b.type or a.length ne b.length);

	/* capture the above data in table*/
Data Temp;
	Length column $500.;

	If index(strip("&COLUMN_NAME_MATCHORNOMATCH"), "ALL COLUMN")>0 then
		column="&COLUMN_NAME_MATCHORNOMATCH";
	else
		column=catx(' ', "Unmatched columns is/are", "&COLUMN_NAME_MATCHORNOMATCH");
	output;

	If index(strip("&COLUMN_Values_MATCHorNOMATCH "), "ALL COLUMN")>0 then
		column="&COLUMN_Values_MATCHorNOMATCH ";
	else
		column=catx(' ', "Unmatched columns for type or length or position
 is/are", "&COLUMN_Values_MATCHorNOMATCH ");
	output;
run;

/* do reporting */
Title "Final info about metadata for test_paylist1 table";

Proc report data=temp nowd;
	column column;
	define column/display "Column";
	compute column;

		if index(column, "ALL COLUMN")=0 then
			call define
 (_col_, "style", "style={background = red}");
		else
			call define
 (_col_, "style", "style={background = green}");
	endcomp;
run;

elwayfan446 · Posted 10-04-2018 02:35 PM

Thank you for the answers everyone. I will give these a try and report back. I appreciate it.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!