DATA Step, Macro, Functions and more

Data cleaning after double-entry with missing

Reply
Occasional Contributor
Posts: 5

Data cleaning after double-entry with missing

I am trying to correct the master dataset using the Correct dataset. The correct dataset was produced after the double-entry. The second column is the Correct dataset indicates the name of the variable that needs to be corrected. The problem about the code below is the missing. Some of the entries that need to be corrected is the missing. If missing is the correct entry then the master dataset should be updated as such. However, when I transpose the Correct dataset, additional empty cells are created, which SAS can't distinguish the different between the correct missing vs missing entries resulted from the transpose. I need a code that only updates the variables listed in the Correct dataset. Thank you in advance.

data master;

  input ID $ var1 $ var2 $;

  datalines;

  123 Yes Red

  124 No Blue

  125 Yes Yellow;


  data correct ;

  input ID $ variable $ correct $;

  datalines;

  123 var1 No

  123 var2 Orange

  124 var2 Red

  125 var1 99;

proc sort data=Correct;by ID;run;

proc transpose data=correct out=cwide (drop=_name_);

  by ID;

  var correct;

  id variable;

run;

proc contents data =cwide;run ;


Data Final;

  update master cwide;

  by ID;

run;

proc print data=final; title "Corrected";run;

Super Contributor
Super Contributor
Posts: 3,174

Re: Data cleaning after double-entry with missing

Did you execute the SAS code piece and analyze/compare the outcome?  The resulting WORK.FINAL dataset is correct, with the replaced column/variable values from the WORK.CORRECT file as would be expected.  With an UPDATE request, SAS does not manipulate any columns for which there is a blank / missing value identified in the UPDATE statement (second argument / file).

Scott Barry

SBBWorks, Inc.

Occasional Contributor
Posts: 5

Re: Data cleaning after double-entry with missing

Right. This is a  "correct missing vs missing entries" question. The var1 for 125 should be empty in the Final dataset but it is not because after transposing Correct dataset, SAS creates the missing field for  Var1 for 124 and Var2 for 125. When I update the Master with the transposed dataset, SAS can't differentiate the 'correct' missing (=Var1 for 125) and other missing entries created by SAS (Var1 for 124 and Var2 for 125).

The Final dataset should show that Var1 for 125 is empty. I hope this clear up some confusion. Sorry about that.

Valued Guide
Posts: 860

Re: Data cleaning after double-entry with missing

I'd add a step to identify missing.  Not sure that I would have come up with the same solution from the start but after looking at the transpose that makes sense to keep.  Although I enjoyed the smiley faces that didn't really answer my previous question.  Below I am loading the file as CSV, so the blank comes through and is clearly missing due to the comma.

This will work, not sure if it suits your purposes.

data master;

infile datalines dsd;

  input ID $ var1 $ var2 $;

  datalines;

123,Yes,Red

124,No,Blue

125,Yes,Yellow

;

data correct;

infile datalines dsd;

  input ID $ variable $ correct $;

  datalines;

123,var1,No

123,var2,Orange

124,var2,Red

125,var1,

;

data correct_II;

set correct;

if missing(correct) then correct = 'missing';

run;

proc sort data=Correct;by ID;run;

proc transpose data=correct_II out=cwide (drop=_name_);

  by ID;

  var correct;

  id variable;

run;

data final;

update master cwide;

by id;

if var1 = 'missing' then call missing(var1);

if var2 = 'missing' then call missing(var2);

run;

Super User
Posts: 10,041

Re: Data cleaning after double-entry with missing

What do you mean by "correct missing vs missing entries" ?

Do you mean if any one of var1 and var2 in CORRECT is missing , the updated should be stopped ?

Code: Program

data master;
  input ID $ var1 $ var2 $;
  datalines;
  123 Yes Red
  124 No Blue
  125 Yes Yellow
  ;

  data correct ;
  input ID $ variable $ correct $;
  datalines;
  123 var1 No
  123 var2 Orange
  124 var2 Red
  125 var1 99
  ;

proc sort data=Correct;by ID;run;
proc transpose data=correct out=cwide (drop=_name_);
  by ID;
  var correct;
  id variable;
run;

Data Final;
  merge master cwide(rename=(var1-var2=_var1-_var2));
  by ID;
  if cmiss(of _var1-_var2)=0 then do;var1=_var1;var2=_var2;end;
  drop _:;
run;

Xia Keshan

Occasional Contributor
Posts: 5

Re: Data cleaning after double-entry with missing

"correct missing vs missing entries". Sorry for the confusion.

Valued Guide
Posts: 860

Re: Data cleaning after double-entry with missing

If something comes through in the correct dataset as missing how is it shown,  How do you know something should be 'correctly' missing?

Occasional Contributor
Posts: 5

Re: Data cleaning after double-entry with missing

Posted in reply to Steelers_In_DC

Right. When you transpose the Correct dataset it will get:

**I modified the original Correct dataset -- For 125, the var1 should be empty.

The problem is that when I transpose the Correct dataset, transposed dataset will automatically create the missing field in var1 for 124 and var 2 for 125 (marked with :smileycrySmiley Happy -- var1 (marked with :smileygrin: ) should be empty for 125 as it is "correctly' missing. However when I update using this transposed Correct dataset, SAS won't update the Var1 to missing as SAS can't tell the difference between :smileygrin: and :smileycry:.

ID          Var1      Var2

123     No         Orange

124     :smileycry:         Red

125      :smileygrin:              :smileycry:

I hope this helps to clear up things.


Occasional Contributor
Posts: 5

Re: Data cleaning after double-entry with missing

*I modified the original Correct dataset -- For 125, the var1 should be empty.

The problem is that when I transpose the Correct dataset, transposed dataset will automatically create the missing field in var1 for 124 and var 2 for 125 (marked with :smileycrySmiley Happy -- var1 (marked with :smileygrin: ) should be empty for 125. However I am only interested to update ones from the Correct dataset --meaning the Final dataset should show that var1 for 125 is empty.

ID          Var1      Var2

123     No         Orange

124     :smileycry:         Red

125      :smileygrin:              :smileycry:

data master;

  input ID $ var1 $ var2 $;

  datalines;

  123 Yes Red

  124 No Blue

  125 Yes Yellow;


  data correct ;

  input ID $ variable $ correct $;

  datalines;

  123 var1 No

  123 var2 Orange

  124 var2 Red

  125 var1  .  ; ** This should be empty. (Character variable --  I didn't know what to put other than .)

proc sort data=Correct;by ID;run;

proc transpose data=correct out=cwide (drop=_name_);

  by ID;

  var correct;

  id variable;

run;

proc contents data =cwide;run ;


Data Final;

  update master cwide;

  by ID;

run;

Super Contributor
Super Contributor
Posts: 3,174

Re: Data cleaning after double-entry with missing

Clearly UPDATE processing doesn't work with such desired behavior, so you will want to consider assigning some "placeholder" unique-string data-value in WORK.CORRECT that would represent "empty" (which would not appear in your data normally), then in the final DATA step to create WORK.FINAL, adjust the data-value(s) back to blank/missing after the BY statement.

Scott Barry

SBBWorks, Inc.

Super User
Super User
Posts: 7,060

Re: Data cleaning after double-entry with missing

If you want your transactions to be able to convert the value to missing then you need to set the value to ._ in the transaction dataset.

data master;

  input ID $ var1 $ var2 $;

datalines;

  123 Yes Red

  124 No Blue

  125 Yes Yellow

;;;;

data trans ;

  infile cards truncover ;

  if 0 then set master;

  input id= var1= var2= ;

datalines;

id=123 var1= No

id=123 var2= Orange

id=124 var2= Red

id=125 var1= _

;;;;

data Final;

  update master trans;

  by ID;

run;

proc print;

run;

Ask a Question
Discussion stats
  • 10 replies
  • 322 views
  • 0 likes
  • 5 in conversation