Solved: Re: delete contrasting records

kb011235 · Posted 07-29-2021 03:00 PM

Hi,

I'm trying to remove rows where key is the same and add_delete is both "A" and "D"

data have ;
infile datalines delimiter=',';
input key $50. add_delete $1. ;
datalines;
1234_1234567654321_P_L720,A
1234_1234567654321_P_L738,A
1234_1234567654321_P_L738,D
1234_1234567654321_P_L821,A
1234_1234567654321_P_L821,D
1234_1234567654321_P_R209,A
1234_1234567654321_P_R209,D
1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D
1234_7654321234567_P_L738,A
1234_7654321234567_P_L738,D
1234_7654321234567_P_L821,D
1234_7654321234567_P_R209,A
1234_7654321234567_P_R209,D
;

The output of this example should be

key	add_delete
1234_1234567654321_P_L720	A
1234_7654321234567_P_L821	D

It might be as simple as counting the occurrences by key and if it's greater than 1, delete. I'm testing that now. Thanks in advance.

PGStats · Posted 07-29-2021 03:16 PM

Like this?

data have ;
infile datalines dsd;
length key $50 add_delete $1;
input key add_delete;
datalines;
1234_1234567654321_P_L720,A
1234_1234567654321_P_L738,A
1234_1234567654321_P_L738,D
1234_1234567654321_P_L821,A
1234_1234567654321_P_L821,D
1234_1234567654321_P_R209,A
1234_1234567654321_P_R209,D
1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D
1234_7654321234567_P_L738,A
1234_7654321234567_P_L738,D
1234_7654321234567_P_L821,D
1234_7654321234567_P_R209,A
1234_7654321234567_P_R209,D
;

proc sql;
select 
    * 
from have as a
where not exists (select * from have as b where a.key=b.key and a.add_delete ne b.add_delete);
quit;

PG

View solution in original post

Reeza · Posted 07-29-2021 03:07 PM

Recode A to 1 and D to -1.
Then sum it by the KEY. Anything that is 0 gets deleted.

Rydhm · Posted 07-29-2021 03:08 PM

But _L720 also has both 'A' and 'D' for same Key.

1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D

PGStats · Posted 07-29-2021 03:16 PM

Like this?

data have ;
infile datalines dsd;
length key $50 add_delete $1;
input key add_delete;
datalines;
1234_1234567654321_P_L720,A
1234_1234567654321_P_L738,A
1234_1234567654321_P_L738,D
1234_1234567654321_P_L821,A
1234_1234567654321_P_L821,D
1234_1234567654321_P_R209,A
1234_1234567654321_P_R209,D
1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D
1234_7654321234567_P_L738,A
1234_7654321234567_P_L738,D
1234_7654321234567_P_L821,D
1234_7654321234567_P_R209,A
1234_7654321234567_P_R209,D
;

proc sql;
select 
    * 
from have as a
where not exists (select * from have as b where a.key=b.key and a.add_delete ne b.add_delete);
quit;

PG

mkeintz · Posted 07-29-2021 03:37 PM

If the

The data are sorted by KEY, as they are in your sample
You never have more than one D per key, or more then one A per key

then a DATA step with a BY statement will work, by keeping only those KEY's with a single observation:

data have ;
infile datalines delimiter=',';
input key :$50. add_delete $1. ;
datalines;
1234_1234567654321_P_L720,A
1234_1234567654321_P_L738,A
1234_1234567654321_P_L738,D
1234_1234567654321_P_L821,A
1234_1234567654321_P_L821,D
1234_1234567654321_P_R209,A
1234_1234567654321_P_R209,D
1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D
1234_7654321234567_P_L738,A
1234_7654321234567_P_L738,D
1234_7654321234567_P_L821,D
1234_7654321234567_P_R209,A
1234_7654321234567_P_R209,D
;
data want;
  set have;
  by key;
  if first.key=1 and last.key=1;
run;

Alternatively, using a condition more analogous to @PGStats's suggestion.

data want;
  merge have (where=(add_delete='A') in=ina)
        have (where=(add_delete='D') in=ind);
  by key;
  where ina=0 or ind=0;
run;

which just says to keep those KEY's in which either A never appears or D never appears.

For large datasets, this may be faster than the SQL solution because it only compares contiguous records for matching KEYs. But again, it requires the data to be sorted by KEY.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

kb011235 · Posted 08-03-2021 11:59 AM

My first try was to use a similar datastep which didn't work:

data want;
  set have;
  by key;
  if first.add_delete='A' and last.add_delete='D' then delete;
run;

Can you help me understand what's wrong with this setup?

Cynthia_sas · Posted 08-03-2021 12:55 PM

Hi:
The problem with this logic is that the FIRST.byvar and LAST.byvar values are only ever 0 or 1 (numbers). And your BY statement in the code is BY KEY; so your program is creating FIRST.KEY and LAST.KEY. But you have coded FIRST.ADD_DELETE and LAST.ADD_DELETE, which should result in messages in the log something like: "NOTE: Invalid numeric data" -- because of the fact that FIRST. and LAST. variables are numeric variables but you are testing for values of 'A' and 'D'.
And if you intend to use FIRST. and LAST. variables in your DATA step program, ALL the BY variables must be listed in the BY statement.
Cynthia

Ksharp · Posted 07-30-2021 09:18 AM

data have ;
infile datalines dsd;
length key $50 add_delete $1;
input key add_delete;
datalines;
1234_1234567654321_P_L720,A
1234_1234567654321_P_L738,A
1234_1234567654321_P_L738,D
1234_1234567654321_P_L821,A
1234_1234567654321_P_L821,D
1234_1234567654321_P_R209,A
1234_1234567654321_P_R209,D
1234_7654321234567_P_L720,A
1234_7654321234567_P_L720,D
1234_7654321234567_P_L738,A
1234_7654321234567_P_L738,D
1234_7654321234567_P_L821,D
1234_7654321234567_P_R209,A
1234_7654321234567_P_R209,D
;

proc sql;
select 
    * 
from have as a
group by key
having count(distinct add_delete)=1;
quit;

Registration is open