DATA Step, Macro, Functions and more

SUPRESS IF THE ENTIRE OBS IS SAME

Reply
Regular Contributor
Posts: 229

SUPRESS IF THE ENTIRE OBS IS SAME

HI i am having data in some obs the data will be repetative i want to supress that records how can i do i.i dont know the length as it may be 3 or 100 how can i do.I want to supress them ex data Test; input id$ 1-100; cards; AAA VVVVVVVVVVVVVVVVVVVVVV EEEEEEEEEEEEEEEEEEEEEEEEEE RTYUY QWEPO ZZZZZZZZZZZZZ KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KOOL GONE RUN; OUTPUT SHOULD BE: RTYUY QWEPO KOOL GONE

Regular Contributor
Posts: 229

Re: SUPRESS IF THE ENTIRE OBS IS SAME

HI i am having data in some obs the data will be repetative i want to supress that records how can i do i.i dont know the length as it may be 3 or 100 how can i do.I want to supress them ex data Test; input id$ 1-100; cards; AAA VVVVVVVVVVVVVVVVVVVVVV EEEEEEEEEEEEEEEEEEEEEEEEEE RTYUY QWEPO ZZZZZZZZZZZZZ KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KOOL GONE RUN ; OUTPUT SHOULD BE : RTYUY QWEPO KOOL GONE CAN REFER THE TEXT

Attachment
Super User
Posts: 11,338

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Are the unwanted strings of repeated characters ALWAYS separated by spaces?

Will the repeated characters ALWAYS be the same character within a repeat group? (Will never have a group like AAAAAZZZZZ that is unwanted.)

Respected Advisor
Posts: 3,799

Re: SUPRESS IF THE ENTIRE OBS IS SAME

if missing(compress(id,first(id))) then delete;

Regular Contributor
Posts: 229

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Posted in reply to data_null__

thqs it worked

Respected Advisor
Posts: 3,156

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Try this:

data Test;

input id:$100.;

cards;

AAA

VVVVVVVVVVVVVVVVVVVVVV

EEEEEEEEEEEEEEEEEEEEEEEEEE

RTYUY

QWEPO

ZZZZZZZZZZZZZ

KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

KOOL

GONE

;

data want;

set test;

if lengthn(compress(id,first(id)))=0 then delete;

run;

proc print;RUN;

Haikuo

Regular Contributor
Posts: 229

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Thqs its working

Super User
Posts: 11,338

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Your example input does not make the string long enough to read the example data.

Respected Advisor
Posts: 4,919

Re: SUPRESS IF THE ENTIRE OBS IS SAME

How about :

data Test;

  input;

  s = compbl(prxchange("s/\b(\w)\1{2,}\b//o", -1, _infile_));

  put s;

datalines;

AAA VVVVVVVVVVVVVVVVVVVVVV EEEEEEEEEEEEEEEEEEEEEEEEEE RTYUY QWEPO ZZZZZZZZZZZZZ KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KOOL GONE

;

It removes any mono-character word of length 3 or more. You could use the pattern "s/\b([[:alpha:]])\1{2,}\b//o" to remove only alphabetic mono character words.

PG

PG
Super User
Posts: 5,496

Re: SUPRESS IF THE ENTIRE OBS IS SAME

More questions.

Do you impose a minimum length to suppress?  (Could a string only 2 characters long be suppressed?)

Do you maintain a list of exceptions?  In your example, AAA might be a legitimate string for some applications.

Is each word on a separate line, or are multiple words on the same line of data?

Super User
Posts: 11,338

Re: SUPRESS IF THE ENTIRE OBS IS SAME

Posted in reply to Astounding

This is a brute force method that works for your example. Caveats: Single characters will be eliminated. A check for length could be added to the execute the line with compress only if length is greater than minimum acceptable duplication. Also, case is not taken into account. If AaA is supposed to be removed it won't unless UPCASE is applied.

The array size is arbitrary but needs to 1) have enough elements to catch all of your repeat strings, 2) each element needs to long enough to contain the longest repeat string.

 

data Test;

input id$ 1-127;

array t {100} $ 100 _t1 -_t100;

do i=1 to (countw(ID));

t= scan(id,i);

if compress(t,first(t)) = '' then t=compress(t,first(t));

end;

outstr = catx(' ', of _t1 - _t100); /* this is the hopefully desired output string*/

drop _t: i;

cards;

AAA VVVVVVVVVVVVVVVVVVVVVV EEEEEEEEEEEEEEEEEEEEEEEEEE RTYUY QWEPO ZZZZZZZZZZZZZ KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KOOL GONE

;

RUN ;

N/A
Posts: 1

Re: SUPRESS IF THE ENTIRE OBS IS SAME

I would use Perl Regular Expressions and do it like this:

This code will eliminate characters a-z and A-Z (ASCII codes 65 to 90) (others can be added of course!) if they appear at least two times in sequence and are surrounded by so called word boundaries.

data Test;

  infile cards;

  input;

  s = _infile_;

  do i=65 to 90;

    s = prxchange(cats("s/\b", byte(i),"{2,}\b//i"), -1, s);

  end;

  s = strip(compbl(s));

cards;

AAA VVVVVVVVVVVVVVVVVVVVVV EEEEEEEEEEEEEEEEEEEEEEEEEE RTYUY QWEPO ZZZZZZZZZZZZZ KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KOOL GONE RUN

;

RUN ;

Kind Regards

Thomas

Ask a Question
Discussion stats
  • 11 replies
  • 400 views
  • 0 likes
  • 7 in conversation