Cleaning randomly messy data

UniversitySas — Sun, 23 Sep 2018 09:20:18 GMT

So let's say I have the following fields:

DATA HAVE;
INFILE DATALINES ;
 input ID NAME $ STREET $ CITY $ STATE $ POSTCODE $ RELATIONSHIP $ STATUS $ PURPOSE $ DONATION $;
 DATALINES;
201 AAA Market Philadelphia PA 4109 Parent Open Counselling 10000
201 ABC Chestnut Arlington TX 1093 None Open General 1500
201 BCD Walnut Walnut Sidney NY 3201 None Open General 
1999 201 EFG Cross Kansas TX 1091 Parent Close Sports 
1491 202 EFG Cluedo Street Phoenix AZ 2012 Close General 1900
;
RUN;

Which gives me the following output:

You can see there are three problems here:

1) The street "walnut" has been imported twice, shifting the values in the columns by 1 extra space incorrectly.

2) The street "Cluedo Street" has been imported over two lines, instead of just one line, causing a similar problem to what was mentioned above.

3) There is an omission for "Relationship" in the final row. Where the incorrectly imported data should read "none", it missing altogether, and reads "close", so even in the absence of the first two errors, the "Relationship" Column here would read "close" instead of "none".

Let's suppose there are 1000's of issues similar, but not identical to the ones above, in a data set with millions of observations. They will be similar in the sense that it's usually a random or repeated omission for some finite number of fields, OR, duplicate values have been entered, or values have spanned more columns than they should have.

Assuming the exact same column names as above, is it plausible that one could create some kind of criteria, or program that could reasonably adjust most of these issues? Or do I simply need to request a cleaner data set?

Re: Cleaning randomly messy data

Kurt_Bremser — Sun, 23 Sep 2018 09:38:09 GMT

If you have blanks in data items, you need to use a delimiter other than blank, or have the data items enclosed in quotes, so the dsd option can be used.

topic Cleaning randomly messy data in SAS Programming

Cleaning randomly messy data

Re: Cleaning randomly messy data