Deleting records selectively

Accepted Solution Solved
Reply
Contributor
Posts: 39
Accepted Solution

Deleting records selectively

Hello Members,

I am stuck at a rather simple problem.

Problem Statement:

I have a dataset from where I would like to selectively delete some records. In the below dataset, I have introduced the count variable called CNT, which is the count of Prc_CD by Id, Date and Prc_CD.

Dataset:

ID      Date           Prc_CD CnNT

101   25Aug2012    C         14

101   25Aug2012     S          2

101   25Aug2012     'null'      2

102   01Jul2012      S            2

102   05May2012   S            6

103    28Jul2012      C           5

103   16Dec2012      C          4

Question is: For a combination of Id and Date, I would like to keep the record for C, else S, if none is available, the null.I tried several ways including lag function. But just cannot produce the right logic.

Scenarios: For a combination of ID and Date, there could be only S in Prc_CD also. So if there is no 'C', I keep S, else null. First priority is C, then S.

Some thing like this:

data want;

set have;

by Id;

lag_id=lag(id);

if first.id then lag_id= ;

if id=lag_id and Prc_CD='C' then do;

     if Prc_CD='S' then delete;

end;

run;

This deletes records with 'S', even if there is no 'C'. Which is not right, as I intend to keep records with 'S' when there is no 'C'. And intend to keep null, when there is no 'C' or 'S'.

Can this data step operation be achieved in Proc SQL as well?

Your direction will be highly appreciated.


Accepted Solutions
Solution
‎06-17-2013 09:23 AM
Respected Advisor
Posts: 3,156

Re: Deleting records selectively

Not sure if this is what you are after, but I have prepared some solutions for you. out of which, the DOW is the most classic, the Hash is the best overall (speed, no sort required), and Proc SQL is the least efficient.

data have;

input ID  :$ Date :date9.           Prc_CD :$ CnNT;

cards;

101   25Aug2012 C         14

101   25Aug2012 S          2

101   25Aug2012 'null'      2

102   01Jul2012      S            2

102   05May2012 S            6

103    28Jul2012      C           5

103   16Dec2012      C 4

;

/*SQL*/

proc sql;

  create table want_sql as

  select * from

    (select * from have

        group by id

having sum(prc_cd='C')>0)   where prc_cd='C'

     union

       select * from

      (select * from have

          group by id

            having sum(prc_cd='C') =0 and sum(prc_cd='S')>0 ) where prc_cd='S'

       union

       select * from

         (select * from have

group by id

having sum(prc_cd in ('C', 'S'))=0) where prc_cd="'null'"

      ;

      quit;

      /*Hash*/

data _null_;

  if _n_=1 then do;

         declare hash h(multidata:'y', ordered:'y');

h.definekey('id','prc_cd');

h.definedata('id','date','prc_cd','cnnt');

h.definedone();

      end;

      set have end=last;

         if prc_cd='C' then h.replace();

         else if prc_cd='S' and h.find(key:id, key:'C') ne 0 then h.replace();

         else if prc_cd not in ('C','S') and h.find(key:id,key:'C') ne 0 and h.find(key:id, key:"S") ne 0 then do; prc_cd='null';h.replace();end;

         if last then h.output(dataset:'want_hash');

      run;

      /*DOW*/

data want_dow;

do until (last.id);

  set have;

   retain flag ' ';

    by id notsorted;

      if prc_cd='C' then flag='C';

        else if prc_cd='S' and flag ne 'C' then flag='S';

        else flag=coalescec(flag,prc_cd);

end;

  do until (last.id);

    set have;

        by id notsorted;

        if prc_cd=flag then output;

   end;

   call missing(flag);

run;

/*Stack*/

data want_stack;

do until (last.id);

  set have(in=up) have(in=down);

    by id;

      retain flag ' ';

        if prc_cd='C' then flag='C';

        else if prc_cd='S' and flag ne 'C' then flag='S';

        else flag=coalescec(flag,prc_cd);

   if down and prc_cd=flag then output;

end;

call missing (flag);

run;

HTH,

Haikuo

View solution in original post


All Replies
Super User
Posts: 5,437

Re: Deleting records selectively

One not so "clean" solution is to create a new temporary column with the values from Prc_CD, where NULL is replaced by a CHAR that is after S in the alphabet (so it will be sorted last).

Then just sort on id, date and Prc-CD_temp, and then another Proc Sort Nodupkey on id and date.

Data never sleeps
Solution
‎06-17-2013 09:23 AM
Respected Advisor
Posts: 3,156

Re: Deleting records selectively

Not sure if this is what you are after, but I have prepared some solutions for you. out of which, the DOW is the most classic, the Hash is the best overall (speed, no sort required), and Proc SQL is the least efficient.

data have;

input ID  :$ Date :date9.           Prc_CD :$ CnNT;

cards;

101   25Aug2012 C         14

101   25Aug2012 S          2

101   25Aug2012 'null'      2

102   01Jul2012      S            2

102   05May2012 S            6

103    28Jul2012      C           5

103   16Dec2012      C 4

;

/*SQL*/

proc sql;

  create table want_sql as

  select * from

    (select * from have

        group by id

having sum(prc_cd='C')>0)   where prc_cd='C'

     union

       select * from

      (select * from have

          group by id

            having sum(prc_cd='C') =0 and sum(prc_cd='S')>0 ) where prc_cd='S'

       union

       select * from

         (select * from have

group by id

having sum(prc_cd in ('C', 'S'))=0) where prc_cd="'null'"

      ;

      quit;

      /*Hash*/

data _null_;

  if _n_=1 then do;

         declare hash h(multidata:'y', ordered:'y');

h.definekey('id','prc_cd');

h.definedata('id','date','prc_cd','cnnt');

h.definedone();

      end;

      set have end=last;

         if prc_cd='C' then h.replace();

         else if prc_cd='S' and h.find(key:id, key:'C') ne 0 then h.replace();

         else if prc_cd not in ('C','S') and h.find(key:id,key:'C') ne 0 and h.find(key:id, key:"S") ne 0 then do; prc_cd='null';h.replace();end;

         if last then h.output(dataset:'want_hash');

      run;

      /*DOW*/

data want_dow;

do until (last.id);

  set have;

   retain flag ' ';

    by id notsorted;

      if prc_cd='C' then flag='C';

        else if prc_cd='S' and flag ne 'C' then flag='S';

        else flag=coalescec(flag,prc_cd);

end;

  do until (last.id);

    set have;

        by id notsorted;

        if prc_cd=flag then output;

   end;

   call missing(flag);

run;

/*Stack*/

data want_stack;

do until (last.id);

  set have(in=up) have(in=down);

    by id;

      retain flag ' ';

        if prc_cd='C' then flag='C';

        else if prc_cd='S' and flag ne 'C' then flag='S';

        else flag=coalescec(flag,prc_cd);

   if down and prc_cd=flag then output;

end;

call missing (flag);

run;

HTH,

Haikuo

Contributor
Posts: 39

Re: Deleting records selectively

Thank you so very much, for your very valuable input.

Super User
Posts: 5,516

Re: Deleting records selectively

If these are the only variables in your data, it shouldn't be hard to do this in one pass through the data.  Your code seemed to by grouping by ID, where the wording in your question indicated the groupings should be by ID DATE.  So I'm grouping here by ID DATE:

data want;

   set have;

   by id date;

   if first.date or prc_cd='C' then do;

      new_prc_cd = prc_cd;

      new_count = CnNT;

   end;

   else if prc_cd='S' and new_prd_cd ne 'C' then do;

      new_prc_cd = prc_cd;

      new_count = CnNT;

   end;

   retain new_prc_cd new_count;

   if last.date;

   drop prc_cd CnNT;

   rename new_prc_cd = prc_cd new_count = CnNT;

run;

Good luck.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 235 views
  • 6 likes
  • 4 in conversation