Solved: Overwriting or removing within the column?

SAS_AMUH · Posted 04-05-2020 09:48 PM

I've learnt about the nodupkey function and it's very useful in removing duplicated observations.

But, is there any way I could overwrite or remove duplicates within the column?

I'm unsure if I have phrased my question correctly, so please see the example below.

Any help would be greatly appreciated.

ID Pattern

1 AABCCA -> ABCA

2 ABBCC -> ABC

3 ABBBCDC -> ABCDC

4 DDCBA -> DCBA

5 CCC -> C

<WANT>

ID Pattern

1 ABCA

2 ABC

3 ABCDC

4 DCBA

5 C

ChrisNZ · Posted 04-05-2020 10:38 PM

Like this?

data WANT;
  set HAVE;
  PATTERN1=prxchange('s/(.)(?=\1)/\2/',-1,PATTERN);
run;


PATTERN	PATTERN1
AABCCA	ABCA
ABBCC	ABC
ABBBCDC	ABCDC
DDCBA	DCBA
CCC	C

High-Performance SAS Coding - Third Edition

View solution in original post

Tom · Posted 04-05-2020 10:30 PM

Hard to tell what you mean. It looks like you mean that PATTERN is a character variable that has strings that look like "AABCCA -> ABCA" and you want to just keep the part after the last greater than sign.

data want;
  set have ;
  if index(pattern,'>') then pattern=scan(pattern,-1,'>');
  else do;
/* not sure what you want to do here 
  since all of your examples have the > sign
*/
  end;
run;

SAS_AMUH · Posted 04-05-2020 10:38 PM

Thank you Tom.

I'm very sorry, I totally forgot to remove those -> arrow signs.

They are not in the actual data.

ID Pattern

1 AABCCA

2 ABBCC

3 ABBBCDC

4 DDCBA

5 CCC

<WANT>

ID Pattern

1 ABCA

2 ABC

3 ABCDC

4 DCBA

5 C

ChrisNZ · Posted 04-05-2020 10:38 PM

Like this?

data WANT;
  set HAVE;
  PATTERN1=prxchange('s/(.)(?=\1)/\2/',-1,PATTERN);
run;


PATTERN	PATTERN1
AABCCA	ABCA
ABBCC	ABC
ABBBCDC	ABCDC
DDCBA	DCBA
CCC	C

High-Performance SAS Coding - Third Edition

SAS_AMUH · Posted 04-06-2020 03:01 AM

Thank you so much.

I would really appreciate it if you could please briefly tell me what each coding means.

I need to modify it to suit to my other similar data and I'd like to learn about this coding.

However, my SAS level is limited and I don't know what keywords I could use to look up in a help site like this or the guide books.

ChrisNZ · Posted 04-06-2020 07:06 AM

This syntax (and the syntax in @Ksharp 's answer) uses regular expressions. It is not special to SAS. In fact it started with the PERL language if I am not mistaken.

There are many tutorials on the web. That's how I learnt.

I'll detail @Ksharp 's expression as I prefer it to mine.

s          enter substitution mode
/          start of matching expression
(\w)       \w means: match any word character, brackets mean: capture the match 
\1+        \1 means: match the previous capture group, + means: at least once
/          end of matching expression, start of replacing expression
\1         replace with first capture group
/          end of expression

So we match one character (and capture it) and match more of the same character.

If this was found, we replace with the matched character.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 04-06-2020 09:11 AM

Regular Expressions have been invented in the 1950's and are at the core of the UNIX operating system. Perl just uses its own syntax, different from the POSIX syntax used in UNIX.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 04-06-2020 06:53 PM

@Kurt_Bremser

Do you know if SAS's first attempt at RegEx (the no longer documented or supported but still available RX functions) was using the posix syntax?

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 04-07-2020 04:59 AM

Well, I found this: https://v8doc.sas.com/sashtml/lgref/z0331192.htm, and the patterns used there look not like what I am used to from UNIX shell scripting.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 04-07-2020 07:33 AM

Mmm interesting . Who knows where they got the SAS patterns from....

Thank you.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 04-06-2020 06:46 AM

data have;
input ID Pattern $;
cards;
1 AABCCA 
2 ABBCC 
3 ABBBCDC 
4 DDCBA 
5 CCC 
;
data want;
 set have;
 want=prxchange('s/(\w)\1+/\1/',-1,Pattern);
run;

Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

Re: Overwriting or removing within the column?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away