Hi. I'm wondering if there's a way to de-dup a list of values stored in a variable. There could be duplicates in the middle or end of the string and there could be multiple duplicates within the string. Any ideas for doing this? I'd really appreciate the help. Thanks.
HAVE_ID_NUM
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
WANT_ID_NUM
1033112065, 1629420120, 1033112011
There isn't a tool that does this exactly, but you can program it:
data want;
set have;
length want_id_num $ 200 one_id $ 10;
do k=1 to countw(have_id_num);
one_id = scan(have_id_num, k, ', ');
if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;
end;
run;
The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.
It's untested code, so might need a small amount of tweaking.
There isn't a tool that does this exactly, but you can program it:
data want;
set have;
length want_id_num $ 200 one_id $ 10;
do k=1 to countw(have_id_num);
one_id = scan(have_id_num, k, ', ');
if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;
end;
run;
The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.
It's untested code, so might need a small amount of tweaking.
Agreed. I saw all equal length values in the post, and didn't consider other possibilities.
same approach but i find catx is convenient
data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;
data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(',',want,temp);
end;
drop temp;
run;
Yes, I started with CATX, but found it difficult to include the space after the comma. There might be a way, but it was easier to switch away from CATX. (Of course, the space after the comma might not be a rigid requirement.)
space after comma
want=1033112065, 1629420120, 1033112011
want=catx(', ',want,temp);
data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
/*do _n_=1 to 10e6;*/
/*output;*/
/*end;*/
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;
data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(', ',want,temp);
end;
drop temp;
run;
data want_array_method;
set have;
array t(100)$10 _temporary_ ;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=strip(scan(HAVE_ID_NUM,_n_,','));
if strip(temp) not in t then do;
t(_n_)=temp;
end;
end;
want=catx(', ', of t(*));
drop temp;
run;
On 10 million records on my machine
Log report
1055 data have;
1056 infile cards truncover;
1057 input HAVE_ID_NUM $200.;
1058 do _n_=1 to 10e6;
1059 output;
1060 end;
1061 cards;
NOTE: The data set WORK.HAVE has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 4.53 seconds
cpu time 1.81 seconds
1063 ;
1064
1065 data want;
1066 set have;
1067 length want $200;
1068 do _n_=1 to countw(HAVE_ID_NUM,',');
1069 temp=scan(HAVE_ID_NUM,_n_,',');
1070 if index((want),strip(temp))=0 then want=catx(', ',want,temp);
1071 end;
1072 drop temp;
1073 run;
NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 1:07.06
cpu time 23.84 seconds
1074
1075
1076 data want_array_method;
1077 set have;
1078 array t(100)$10 _temporary_ ;
1079 length want $200;
1080 do _n_=1 to countw(HAVE_ID_NUM,',');
1081 temp=strip(scan(HAVE_ID_NUM,_n_,','));
1082 if strip(temp) not in t then do;
1083 t(_n_)=temp;
1084 end;
1085 end;
1086 want=catx(', ', of t(*));
1087 drop temp;
1088 run;
NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT_ARRAY_METHOD has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 28.25 seconds
cpu time 23.56 seconds
different approach which is reliant on the original list being comma separated.
%let have_id_num = 1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120;
proc sql ;
create table a as
select
&have_id_num
from
sashelp.air (obs=1)
;
quit;
proc transpose data=a out=b;
run;
proc sql noprint;
select distinct col1 format=20. into: WANT_ID_NUM separated by ', '
from b;
quit;
%put WANT_ID_NUM=&WANT_ID_NUM;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.