SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Advice on data cleaning

Regular Contributor
Posts: 249

Advice on data cleaning

Hi everyone.


Can I please seek your advise on how to write the SAS program to clean my survey based data in a way that can be used to perform routine data check because the data collection is till going. That is whatever that has previously been checked and clarified as logical/plausible/truly unavailable answer will not show up again when I run the data cleaning program next time.


Thank you very much Smiley Happy

Super User
Super User
Posts: 9,402

Re: Advice on data cleaning

This isn't really a Q&A secnario here.  If I was asked to this then I would probably look at something like this:

Say you have data:

SUBJ   Q1   Q2   Q3   Q4...

Now you need to keep a record on each obs, the above structure isn't good for that.  So step one is to have a normalised dataset:


...         1            ...

...         2            ...



Why does this change matter so much, well, you can simply add additional data to each observation this way, say you want a flag for locked, a date for last checked, and outstanding qeury coded item:


...         1            ...            N             12DEC2015    Result_Missing

...         2            ...            Y              14JAN2016



The main thing will be how to know when to update things, say you have cleaned a data item, and consider it locked, if the data next transfer comes in and has changed...


Personally, I would run your suite of checks on the whole data at each timepoint, and just compare that to a list of outstanding items.  Pretty simple, but a manual.


Super User
Posts: 13,295

Re: Advice on data cleaning

If I thought keeping track was absolutely necessary I would ensure that my original data has a unique identifier for each record.

Then after I had checked/cleaned data I would have a data set of the identifiers checked.

The "next time" I cleaned the data I would subset the data to those records whose idendifiers were not in the data set of the already checked. Then update the identifier set with those checked. Repeat as needed.


But there are a number of other issues involved I don't go into without getting paid...

Ask a Question
Discussion stats
  • 2 replies
  • 3 in conversation