topic Re: Remove similarities in SAS Programming

Remove similarities

Nicholasamh — Mon, 25 Nov 2019 07:35:07 GMT

Hi, I am new to SAS code. I have like to remove those data that are consist of the same words to each other based on ID. Really appreciate someone can help.
Example:
ID Transport
1 Singapore Car A
1 Car A
1 UK
1 UK Van CA
2 Japan Airplane GA
2 Airplane GA
.......
Expected outcome:
ID Transport
1 Singapore Car A
1 UK Van CA
2 Japan Airplane GA

Re: Remove similarities

Patrick — Mon, 25 Nov 2019 10:26:55 GMT

Welcome to the SAS communities.

Your problem is not trivial and requires a bit more than novice level coding. I couldn't think of something simpler than below two coding options (one datastep, one SQL).

data have;
  infile datalines truncover;
  input Group_ID Transport $40.;
  datalines;
1 Singapore Car A
1 Car A
1 UK
1 UK Van CA
2 Japan Airplane GA
2 Airplane GA
;

/* SAS datastep using Hash lookup */
data want(drop=_:);
  if _n_=1 then
    do;
      if 0 then set have(rename=(Transport=_Transport));
      dcl hash h1(dataset:'have(rename=(Transport=_Transport))', multidata:'y');
      h1.defineKey('Group_ID');
      h1.defineData('_Transport');
      h1.defineDone();
    end;
  set have;

  do while(h1.do_over() = 0);
    if Transport=_Transport then continue;
    if find(_Transport,Transport,'it')>0 then
      do;
        delete;
      end;
  end;
run;

proc print data=want;
run;

/* SQL using EXIST clause */
proc sql feedback;
/*  create table want2 as*/
    select 
      o.*
    from have o
    where not exists
      (
        select *
        from have i
        where
          i.group_id=o.group_id
          and i.transport ne o.transport
          and find(i.Transport,o.Transport,'it')>0
      )
    ;
quit;

If you don't understand the code then search in the SAS docu what you find about the Hash object and the SQL EXIST clause.

If you're new to SQL then not sure which of the two options is easier to understand for you. If you've got already some familiarity with SQL then it's likely the EXIST clause.

Re: Remove similarities

PeterClemmensen — Mon, 25 Nov 2019 11:32:37 GMT

@Nicholasamh Welcome to the SAS Community! 🙂

data have;
input ID Transport $ 3-103;
infile datalines truncover;
datalines;
1 Singapore Car A
1 Car A
1 UK
1 UK Van CA
2 Airplane GA
2 Japan Airplane GA
;

data want(drop=i j a);
    array _{9999} $ 100 _temporary_;
    
    do i=1 by 1 until (last.id);
        set have;
        by id;
        _[i]=Transport;
    end;

    do until (last.id);
        set have;
        by id;
        a=0;
        do j=1 to dim(_);
           a=max(a, (find(_[j], transport, 'it') & _[j] ne transport));
           if j=dim(_) & a=0 then output;
        end;
    end;
run;

Result:

ID  Transport
1   Singapore Car A
1   UK Van CA
2   Japan Airplane GA

Re: Remove similarities

s_lassen — Mon, 25 Nov 2019 10:10:13 GMT

The simplest solution is probably something like this:

proc sql;
  create table want as 
  select * from have base where not exists(
    select * from have where id=base.id and Transport ne base.Transport 
    and indexw(Transport,trim(base.transport))>0
    )
  ;

Re: Remove similarities

Ksharp — Mon, 25 Nov 2019 11:31:36 GMT

data have;
input ID Transport $ 3-103;
infile datalines truncover;
datalines;
1 Singapore Car A
1 Car A
1 UK
1 UK Van CA
2 Airplane GA
2 Japan Airplane GA
;
proc sql;
create table want as
select distinct a.*
 from have as a,have as b
  where a.id=b.id and a.Transport ne b.Transport 
and a.Transport contains strip(b.Transport);
quit;

Re: Remove similarities

Nicholasamh — Tue, 26 Nov 2019 05:36:31 GMT

Thanks for all the great solutions