Programming the statistical procedures from SAS

Messy Data-clean up

Reply
Super Contributor
Posts: 1,040

Messy Data-clean up

Please help me build the code with the following instructions for the code below:

Variable X can take values of A B or Missing.

When A and B are present for the same patient I want to keep both the observations(pt no 1 below)

when A and missing are present for the same patient I want to consider only the observation containing A .(pt no 2 below)

when missing and missing are present for the same patient I want to consider only one of those missing observations .(pt no3 below)

when missing  and A are present for the same patient I want to consider only the observation containing A .(pt no 4 below)

when both have a value of A then I want to consider only one observation  from those two(pt no 5 below)

In the last case I want to consider only one observation containing A

variable X(charecter)        patient No

A                                           1

B                                           1

A                                           2

Missing                                  2

Missing                                  3

Missing                                3

                                         

Missing                                4

A                                           4

A                                             5

A                                             5

A                                              6

A                                              6

Missing                                    6

Thanks

Super User
Posts: 9,782

Re: Messy Data-clean up

OK. How about:

data have;
input x $ no;
cards;
A                                           1
B                                           1
A                                           2
.                                  2
.                                  3
.                                3 
.                                4
A                                           4
A                                             5
A                                             5
A                                              6
A                                              6
.                                    6
;
run;
proc sql;
create table want as
select distinct * from have where x is not missing 
union 
select distinct * from have group by no having sum(missing(x)) eq count(no)
 order by no;
quit;

Ksharp

Respected Advisor
Posts: 3,152

Re: Messy Data-clean up

How about join:

proc sql noprint;

  create table want as

  select * from

    (select distinct pno from have) a

      left join

      (select distinct * from have where x is not missing)b

      on

      a.pno=b.pno;

quit;

Haikuo

PROC Star
Posts: 7,417

Re: Messy Data-clean up

One way to do it using data step code might be:

proc sort data=have;

  by no descending x;

run;

data want;

  set have;

  by no descending x;

  if first.no then do;

    counter=0;

    mcounter=0;

  end;

  counter+1;

  if missing(x) then do;

    mcounter+1;

    if last.no and counter eq mcounter then output;

  end;

  else if last.x then output;

run;

Super Contributor
Posts: 1,040

Re: Messy Data-clean up


The code works perfectly fine.

Thanks for the effort.

I have a couple of questions though.

when I ran the code till mcounter+1 it is as shown below.

Please correct me if my interpretation is wrong.

Also explain what is the meaning of mcounter

X    no      counter      mcounter

B     1           1                0

A      1           2                0

A       2          1                0

          2          2                1------------------missing X so mcounter is 1

          3          1                1------------------missing X so mcounter shd have been 1(current value from above) +1=2

          3           2                2 -----------------missing X so mcounter shd have been 2(current value from above)+1=3

A        4           1                0      

PROC Star
Posts: 7,417

Re: Messy Data-clean up

Not sure exactly what you ran.  The code sets the mcounter values as you described, not what you mentioned as current values.  Since x was sorted in descending order, missing values will come last.  The purpose of having both counter and mcounter was so that the code could identify where an id (no) had only missing values.  In such a case, it could both identify and output the record.

mcounter, for me, just represented missing_counter, i.e., the number of missing values.

Respected Advisor
Posts: 3,152

Re: Messy Data-clean up

Here is another data step using array(). It is not as robust as Art's approach, as you need to preset the max quantity of members one id can contain.

data have;

input (X        pNo) (:$);

cards;

A                                           1

B                                           1

A                                           2

.                                  2

.                                  3

.                              3

.                                4

A                                           4

A                                             5

A                                             5

A                                              6

A                                              6

.                                    6

;

data want;  array t(1:10) $1. _temporary_;

_c=0;

do _n_=1 by 1 until (last.pno);

  set have;

   by pno notsorted;

    if x not in t then do;

        output;

        t(_n_)=x;

        _c+1;

      end;

    if last.pno and _c=0 then output;

end;

call missing (of t(*));

drop _c;

run;

Haikuo

Super Contributor
Posts: 1,636

Re: Messy Data-clean up

another way:

data have;

input x $ no;

cards;

A                                           1

B                                           1

A                                           2

.                                  2

.                                  3

.                                3

.                                4

A                                           4

A                                             5

A                                             5

A                                              6

A                                              6

.                                    6

;

proc sort data=have(where=(^missing(x))) out=have1 nodupkey;

by no x;

proc sort data=have(where=(missing(x))) nodupkey dupout=have2 ;

by no ;

data want;

  set have1  have2;

  by no;

run;

proc print;run;

Super Contributor
Posts: 1,040

Re: Messy Data-clean up

Linlin thnaks for your efforts too...

What is special about dupout????

i understood till:

all missings are considered and duplicates of missing are removed by no. I should have got have 2 like shown below:

.     2

.      3

.      4

.      6

With your dupout code I am only getting

.     3

Thanks

PROC Star
Posts: 7,417

Re: Messy Data-clean up

Dupout can be useful, but not (I don't think) for what you are trying to do.  Dupout captures ALL of the duplicates EXCEPT for the initial one.  In the case of your current problem, it doesn't do what you want.  E.g., the following would capture all kinds of records that you don't want to capture:

data have;

input x $ no;

cards;

A                                           1

B                                           1

A                                           2

.                                  2

.                                  2

.                                  3

.                                  3

.                                3

.                                4

A                                           4

A                                             5

A                                             5

A                                              6

A                                              6

.                                    6

;

proc sort data=have(where=(^missing(x))) out=have1 nodupkey;

by no x;

proc sort data=have(where=(missing(x))) nodupkey dupout=have2 ;

by no ;

data want;

  set have1  have2;

  by no;

run;

Super Contributor
Posts: 1,636

Re: Messy Data-clean up

Art,

My code works for the sample data OP provided.:smileysilly:

Super Contributor
Posts: 1,040

Re: Messy Data-clean up

Hi Arthur,

Linlins code is also giving the right answer

She is creating a dupout dataset with no duplicates(have2) and setting it finally to have1;

to remind you again when we have x doubly missing we shd take it once.sheould not eliminate completely

Would you agree to that

PROC Star
Posts: 7,417

Re: Messy Data-clean up

Yes, it gives the right answer for the sample you provided but, if one id has one record with A, and two records with missing, I understood that you would only want the one record with A.

The proc sort approach would give you one record with A and one record with missing.

Thus the answer depends upon what you really want.

Respected Advisor
Posts: 3,152

Re: Messy Data-clean up

Inspired by Linlin's proc sort and Using the strategy I suggested in SQL approach, we can also use the following:

proc sort data=have(keep=no) out=have1 nodupkey;

by no;

proc sort data=have(where=(^missing(x))) nodupkey out=have2 ;

by no x;

data want;

  merge have1  have2;

  by no;

run;

Haikuo

Super Contributor
Posts: 1,040

Re: Messy Data-clean up

Hi Arthur,(listed is the partial output of your code when run till mcounter+1)

X        no    counter      mcounter

            3         1                1

             3        2                2

A         7        1                 0

           7         2                1

           7          3                2

if last.no and counter eq mcounter then output;

end;

else if last.x then output;

for pt no7 : for last.no (7    3    2 )   counter is ne mcounter so that condition is false.

now it has to output last.x

In the output for your code I get ( 7    1     0)

Should it not have to be         (    7    3       2)

Could you justify please

Ask a Question
Discussion stats
  • 17 replies
  • 597 views
  • 1 like
  • 5 in conversation