Re: delete the dup that is in a group

podarum · Posted 11-25-2011 08:55 PM

Hi, I tried few procedures btu can't quite get it.

I have a bunch of dup and I would like to delete the dup that is NOT in one of the list in another dataset or specified in the proc.

Example:

Have:

PS CIT TWN

L1T Tor AJX

L1T Tor OTT

L1T Tor SHE

M1P Tor YYZ

M1P Tor ZHE

I want to keep the dup that TWN is in a list of (AJX, YYZ).. the rest delete. (the dups must be of PS and CIT obviously)

So want:

PS CIT TWN

L1T Tor AJX

M1P Tor YYZ

Thanks

Ksharp · Posted 11-25-2011 09:21 PM

It would be helpful that you can post some more sample data.

data have;
input (PS   CIT  TWN) ($);
cards;
L1T  Tor  AJX
L1T  Tor  OTT
L1T  Tor  SHE
M1P Tor  YYZ
M1P Tor  ZHE
;
run;
proc sql;
 create table want as
  select *
   from have
    group by ps,cit
      having count(*) ge 2 and twn in ('AJX', 'YYZ');
quit;

Ksharp

podarum · Posted 11-26-2011 11:53 PM

I get this Note and 0 observatiosn show up:

NOTE: The query requires remerging summary statistics back with the original data.

art297 · Posted 11-27-2011 10:21 AM

The note is normal and not indicative of any problem. That you get 0 observations, of course, is a significant problem. The data look like city names, forward postal code sorting areas, and airport codes. Are they case sensitive or might some records have all lower or upper case or some other feature that would make it so that expected matches won't be found? You'll have to provide some sample data for which you get the results you mentioned.

podarum · Posted 11-27-2011 01:34 PM

The code worked well, but it also deletes any indipendedn observations that I would want to keep eg. The M1G Tor YYZ or L2E Osh YYZ

Have:

PS CIT TWN

L1T Tor AJX

L1T Tor OTT

L1T Tor SHE

M1P Tor YYZ

M1P Tor ZHE

M1G Tor YYZ

L2E Osh YYZ

art297 · Posted 11-27-2011 01:52 PM

Then I'm confused as that it what the code I originally proposed does. I.e.,

data have;

input (PS CIT TWN) ($);

cards;

L1T Tor AJX

L1T Tor OTT

L1T Tor SHE

M1P Tor YYZ

M1P Tor ZHE

M1G Tor YYZ

L2E Osh YYZ

;

proc sort data=have out=want;

by ps cit twn;

run;

proc sort data=want nodupkey;

by ps cit;

run;

results in a file with 4 records, namely:

ps cit twn

L1T Tor AJX

L2E Osh YYZ

M1G Tor YYZ

M1P Tor YYZ

podarum · Posted 11-27-2011 02:03 PM

You are right if the TWN is always the first alphabetical one to use, but this is just a small example I used to illustarte. But I found a solution..

data want;

set have;

by PS;

retain PS;

if first.PS =last.PS then use 'Y'; else if CIT in ('AJX', 'YYZ') then use = 'Y' else use = 'N'

run;

art297 · Posted 11-27-2011 02:23 PM

That wouldn't work unless you mistyped it it meant to type:

if first.PS =last.PS then use='Y';

however, that wouldn't work as you expect. Given 3 duplicate records, then 2nd record would get accepted as both first.ps and last.ps would equal 0, but would be accepted as "use" since it would meet the condition.

and the rest of the line is missing two more semicolons, none of your cit fields have those values, and the file would still have to be preceded by the correct proc sort.

If it is sorted correctly, then the following would have the same effect as the code I suggested:

data want;

set have;

by PS;

if not((first.PS and last.PS) or twn in ('AJX', 'YYZ'))

then delete;

run;

podarum · Posted 11-27-2011 08:15 PM

Yes I meant to type

if first.PS =last.PS then use='Y'; But this is meant only for the non-duplicate records (or individual records), the dups would fall under else if CIT in ('AJX', 'YYZ') then use = 'Y' else use = 'N' . I also mistyped the 2 semi-colons. You are also right, your code gave me the same results.

art297 · Posted 11-25-2011 09:22 PM

What is your rule for deleting duplicates? It it is first alphabetical, as in your example, then first sort, then sort with nodupkey. E.g.,

proc sort data=have out=want;

by ps cit twn;

run;

proc sort data=want nodupkey;

by ps cit;

run;

podarum · Posted 11-25-2011 10:23 PM

art my rule is if it's not in the list of selected TWN then it must be deleted

art297 · Posted 11-26-2011 12:24 AM

Then I vote for Ksharp's suggested code.

Ksharp · Posted 11-27-2011 09:12 PM

If you also need to keep the group which only contain one obs. Why not directly use WHERE statement?

data want;

set have;

where twn in ('AJX', 'YYZ');

run;

Ksharp

SAS Innovate 2025: Call for Content

Classroom Training Available!