I've learnt about the nodupkey function and it's very useful in removing duplicated observations.
But, is there any way I could overwrite or remove duplicates within the column?
I'm unsure if I have phrased my question correctly, so please see the example below.
Any help would be greatly appreciated.
<ORIGINAL>
ID Pattern
1 AABCCA -> ABCA
2 ABBCC -> ABC
3 ABBBCDC -> ABCDC
4 DDCBA -> DCBA
5 CCC -> C
<WANT>
ID Pattern
1 ABCA
2 ABC
3 ABCDC
4 DCBA
5 C
Like this?
data WANT;
set HAVE;
PATTERN1=prxchange('s/(.)(?=\1)/\2/',-1,PATTERN);
run;
PATTERN | PATTERN1 |
---|---|
AABCCA | ABCA |
ABBCC | ABC |
ABBBCDC | ABCDC |
DDCBA | DCBA |
CCC | C |
Hard to tell what you mean. It looks like you mean that PATTERN is a character variable that has strings that look like "AABCCA -> ABCA" and you want to just keep the part after the last greater than sign.
data want;
set have ;
if index(pattern,'>') then pattern=scan(pattern,-1,'>');
else do;
/* not sure what you want to do here
since all of your examples have the > sign
*/
end;
run;
Thank you Tom.
I'm very sorry, I totally forgot to remove those -> arrow signs.
They are not in the actual data.
<ORIGINAL>
ID Pattern
1 AABCCA
2 ABBCC
3 ABBBCDC
4 DDCBA
5 CCC
<WANT>
ID Pattern
1 ABCA
2 ABC
3 ABCDC
4 DCBA
5 C
Like this?
data WANT;
set HAVE;
PATTERN1=prxchange('s/(.)(?=\1)/\2/',-1,PATTERN);
run;
PATTERN | PATTERN1 |
---|---|
AABCCA | ABCA |
ABBCC | ABC |
ABBBCDC | ABCDC |
DDCBA | DCBA |
CCC | C |
Thank you so much.
I would really appreciate it if you could please briefly tell me what each coding means.
I need to modify it to suit to my other similar data and I'd like to learn about this coding.
However, my SAS level is limited and I don't know what keywords I could use to look up in a help site like this or the guide books.
This syntax (and the syntax in @Ksharp 's answer) uses regular expressions. It is not special to SAS. In fact it started with the PERL language if I am not mistaken.
There are many tutorials on the web. That's how I learnt.
I'll detail @Ksharp 's expression as I prefer it to mine.
s enter substitution mode
/ start of matching expression
(\w) \w means: match any word character, brackets mean: capture the match
\1+ \1 means: match the previous capture group, + means: at least once
/ end of matching expression, start of replacing expression
\1 replace with first capture group
/ end of expression
So we match one character (and capture it) and match more of the same character.
If this was found, we replace with the matched character.
Regular Expressions have been invented in the 1950's and are at the core of the UNIX operating system. Perl just uses its own syntax, different from the POSIX syntax used in UNIX.
Do you know if SAS's first attempt at RegEx (the no longer documented or supported but still available RX functions) was using the posix syntax?
Well, I found this: https://v8doc.sas.com/sashtml/lgref/z0331192.htm, and the patterns used there look not like what I am used to from UNIX shell scripting.
Mmm interesting . Who knows where they got the SAS patterns from....
Thank you.
data have;
input ID Pattern $;
cards;
1 AABCCA
2 ABBCC
3 ABBBCDC
4 DDCBA
5 CCC
;
data want;
set have;
want=prxchange('s/(\w)\1+/\1/',-1,Pattern);
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.