BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
buechler66
Barite | Level 11

Hi.  I'm wondering if there's a way to de-dup a list of values stored in a variable.  There could be duplicates in the middle or end of the string and there could be multiple duplicates within the string.  Any ideas for doing this? I'd really appreciate the help.  Thanks.

 

HAVE_ID_NUM
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120

 

WANT_ID_NUM

1033112065, 1629420120, 1033112011

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

There isn't a tool that does this exactly, but you can program it:

 

data want;

set have;

length want_id_num $ 200  one_id $ 10;

do k=1 to countw(have_id_num);

   one_id = scan(have_id_num, k, ', ');

   if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;

end;

run;

 

The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.

 

It's untested code, so might need a small amount of tweaking.

View solution in original post

9 REPLIES 9
Astounding
PROC Star

There isn't a tool that does this exactly, but you can program it:

 

data want;

set have;

length want_id_num $ 200  one_id $ 10;

do k=1 to countw(have_id_num);

   one_id = scan(have_id_num, k, ', ');

   if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;

end;

run;

 

The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.

 

It's untested code, so might need a small amount of tweaking.

Ksharp
Super User

@Astounding

I think better use INDEXW() or FINDW()  for this kind of scenario .

 

123456,123,

Astounding
PROC Star

Agreed.  I saw all equal length values in the post, and didn't consider other possibilities.

novinosrin
Tourmaline | Level 20

same approach but i find catx is convenient

 

data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;

data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(',',want,temp);
end;
drop temp;
run;
Astounding
PROC Star

Yes, I started with CATX, but found it difficult to include the space after the comma.  There might be a way, but it was easier to switch away from CATX.  (Of course, the space after the comma might not be a rigid requirement.)

novinosrin
Tourmaline | Level 20

space after comma

want=1033112065, 1629420120, 1033112011

 

 

want=catx(', ',want,temp);
data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
/*do _n_=1 to 10e6;*/
/*output;*/
/*end;*/
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;

data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(', ',want,temp);
end;
drop temp;
run;


data want_array_method;
set have;
array t(100)$10 _temporary_ ;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=strip(scan(HAVE_ID_NUM,_n_,','));
if strip(temp) not in t then do;
t(_n_)=temp;
end;
end;
want=catx(', ', of t(*));
drop temp;
run;

 

novinosrin
Tourmaline | Level 20

On 10 million records on my machine

 

Log report

1055 data have;
1056 infile cards truncover;
1057 input HAVE_ID_NUM $200.;
1058 do _n_=1 to 10e6;
1059 output;
1060 end;
1061 cards;

NOTE: The data set WORK.HAVE has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 4.53 seconds
cpu time 1.81 seconds


1063 ;
1064
1065 data want;
1066 set have;
1067 length want $200;
1068 do _n_=1 to countw(HAVE_ID_NUM,',');
1069 temp=scan(HAVE_ID_NUM,_n_,',');
1070 if index((want),strip(temp))=0 then want=catx(', ',want,temp);
1071 end;
1072 drop temp;
1073 run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 1:07.06
cpu time 23.84 seconds


1074
1075
1076 data want_array_method;
1077 set have;
1078 array t(100)$10 _temporary_ ;
1079 length want $200;
1080 do _n_=1 to countw(HAVE_ID_NUM,',');
1081 temp=strip(scan(HAVE_ID_NUM,_n_,','));
1082 if strip(temp) not in t then do;
1083 t(_n_)=temp;
1084 end;
1085 end;
1086 want=catx(', ', of t(*));
1087 drop temp;
1088 run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT_ARRAY_METHOD has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 28.25 seconds
cpu time 23.56 seconds

 

34reqrwe
Quartz | Level 8

different approach which is reliant on the original list being comma separated. 

 

%let have_id_num = 1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120;

proc sql ;
create table a as
select
&have_id_num
from
sashelp.air (obs=1)
;
quit;

proc transpose data=a out=b;
run;

proc sql noprint;
select distinct col1 format=20. into: WANT_ID_NUM separated by ', '
from b;
quit;

%put WANT_ID_NUM=&WANT_ID_NUM;

buechler66
Barite | Level 11
Thank you both for the solutions. I really appreciate your time!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 753 views
  • 1 like
  • 5 in conversation