DATA Step, Macro, Functions and more

Efficient data cleaning and recoding

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 78
Accepted Solution

Efficient data cleaning and recoding

I have the following 7 variables in my dataset:

type_1 - type_7

 

The observations for each one of these is supposed to be either "checked" or "unchecked"in the datset. The first thing I want to do is check that there is only 1 observation for the 7 variables that is "checked" and the rest are "unchecked." What is the most efficient way to do this?

 

The second thing I want to do is recode the 7 variables into 1 variable where the value is the label for variable that is "checked."  What is the most efficient way to do this?


Accepted Solutions
Solution
‎05-22-2017 07:02 PM
PROC Star
Posts: 7,474

Re: Efficient data cleaning and recoding

BTW: Interestinglly, the exact same code works if the variables are numeric 1s and 0s, rather than character "1"s and "0"s:

data have;
  label type_1='type of surgery is aaa';
  label type_2='type of surgery is bbb';
  label type_3='type of surgery is ccc';
  label type_4='type of surgery is ddd';
  label type_5='type of surgery is eee';
  label type_6='type of surgery is fff';
  label type_7='type of surgery is ggg';
  input type_1 - type_7;
  cards;
1 0 0 0 0 0 0
0 0 0 0 0 1 1
0 0 1 0 0 0 0
0 0 0 0 0 0 0
;

data want multiples missing;
  set have;
  array vars(*) type_1-type_7;
  recnum=_n_;
  if sum(of vars(*)) eq 0 then output missing;
  else if sum(of vars(*)) gt 1 then output multiples;
  else do;
    type=substr(vlabel(vars(whichn(1,of vars(*)))),20);
    output want;
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Trusted Advisor
Posts: 1,022

Re: Efficient data cleaning and recoding

Please help us help you.  How about showing what the data looks like before your process.  And what you want it to look like afterwards.

PROC Star
Posts: 7,474

Re: Efficient data cleaning and recoding

Are the current values "checked" and 'unchecked" or, if not, are the variables character or numeric, and what values represent checked and unchecked. Also, what do you want to do if more than one is checked?

 

Art, CEO, AnalystFinder.com

Frequent Contributor
Posts: 78

Re: Efficient data cleaning and recoding

[ Edited ]

Yes sorry! Here is a look at the data:

 

value type_1 0='Unchecked' 1='Checked';

label type_1 'Type of surgery is xxx';

 

     same for type_2 - type_7 but different surgeries xxx.

 

type_1            type_2             type_3     ....      type_7

unchecked      checked          unchecked         unchecked

unchecked      unchecked      unchecked         checked

...

 

 Would like it to look like:

type

xxx (from type_2)

xxx (from type_7)

 

 

if more than 1 is checked, I just want to be able to identify that and check the discrepancy with the investigator.

 

Thanks for your help!

PROC Star
Posts: 7,474

Re: Efficient data cleaning and recoding

[ Edited ]
data have;
  informat type_1 - type_7 $9.;
  label type_1='type of surgery is aaa';
  label type_2='type of surgery is bbb';
  label type_3='type of surgery is ccc';
  label type_4='type of surgery is ddd';
  label type_5='type of surgery is eee';
  label type_6='type of surgery is fff';
  label type_7='type of surgery is ggg';
  input type_1 - type_7;
  cards;
checked  unchecked  unchecked  unchecked  unchecked  unchecked  unchecked
unchecked  unchecked  unchecked  checked  unchecked  unchecked  unchecked
checked  unchecked  checked  unchecked  unchecked  unchecked  unchecked
;

data want multiples;
  set have;
  array vars(*) $ type_1-type_7;
  if count(catt(of vars(*)), 'un') lt 6 then output multiples;
  else do;
    type=substr(vlabel(vars(whichc('checked',of vars(*)))),20);
    output want;
  end;
run;

Art, CEO, AnalystFinder.com

 

Super User
Posts: 11,343

Re: Efficient data cleaning and recoding

Minor difference on @art297's approach basically to turn the checked/unchecked into numerics AND report as read that there may be a problem. Then clean it before

proc format library=work;
invalue Checked
'checked'  = 1
'unchecked'= 0
other=.;
run;

data have;
  informat type_1 - type_7 Checked.;
  label type_1='type of surgery is aaa';
  label type_2='type of surgery is bbb';
  label type_3='type of surgery is ccc';
  label type_4='type of surgery is ddd';
  label type_5='type of surgery is eee';
  label type_6='type of surgery is fff';
  label type_7='type of surgery is ggg';
  input type_1 - type_7;
  TypeSum = sum(of Type_:);
  if typesum ne 1 then put "Problem with number checked for Type in record: " _n_;
 cards;
checked  unchecked  unchecked  unchecked  unchecked  unchecked  unchecked
unchecked  unchecked  unchecked  checked  unchecked  unchecked  unchecked
checked  unchecked  checked  unchecked  unchecked  unchecked  unchecked
;
run;
PROC Star
Posts: 7,474

Re: Efficient data cleaning and recoding

OK. I hadn't correctly read you're specs. How about:

data have;
  informat type_1 - type_7 $9.;
  label type_1='type of surgery is aaa';
  label type_2='type of surgery is bbb';
  label type_3='type of surgery is ccc';
  label type_4='type of surgery is ddd';
  label type_5='type of surgery is eee';
  label type_6='type of surgery is fff';
  label type_7='type of surgery is ggg';
  input type_1 - type_7;
  cards;
1 0 0 0 0 0 0
0 0 0 0 0 1 1
0 0 1 0 0 0 0
0 0 0 0 0 0 0
;

data want multiples missing;
  set have;
  array vars(*) type_1-type_7;
  recnum=_n_;
  if sum(of vars(*)) eq 0 then output missing;
  else if sum(of vars(*)) gt 1 then output multiples;
  else do;
    type=substr(vlabel(vars(whichn(1,of vars(*)))),20);
    output want;
  end;
run;

Art, CEO, AnalystFinder.com

 

Solution
‎05-22-2017 07:02 PM
PROC Star
Posts: 7,474

Re: Efficient data cleaning and recoding

BTW: Interestinglly, the exact same code works if the variables are numeric 1s and 0s, rather than character "1"s and "0"s:

data have;
  label type_1='type of surgery is aaa';
  label type_2='type of surgery is bbb';
  label type_3='type of surgery is ccc';
  label type_4='type of surgery is ddd';
  label type_5='type of surgery is eee';
  label type_6='type of surgery is fff';
  label type_7='type of surgery is ggg';
  input type_1 - type_7;
  cards;
1 0 0 0 0 0 0
0 0 0 0 0 1 1
0 0 1 0 0 0 0
0 0 0 0 0 0 0
;

data want multiples missing;
  set have;
  array vars(*) type_1-type_7;
  recnum=_n_;
  if sum(of vars(*)) eq 0 then output missing;
  else if sum(of vars(*)) gt 1 then output multiples;
  else do;
    type=substr(vlabel(vars(whichn(1,of vars(*)))),20);
    output want;
  end;
run;

Art, CEO, AnalystFinder.com

 

Frequent Contributor
Posts: 78

Re: Efficient data cleaning and recoding

Thank you! Worked beautifully.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 205 views
  • 2 likes
  • 4 in conversation