Solved: De-dup a list of values stored in a variable separated by commas

buechler66 · Posted 11-20-2018 05:14 PM

Hi. I'm wondering if there's a way to de-dup a list of values stored in a variable. There could be duplicates in the middle or end of the string and there could be multiple duplicates within the string. Any ideas for doing this? I'd really appreciate the help. Thanks.

HAVE_ID_NUM
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120

WANT_ID_NUM

1033112065, 1629420120, 1033112011

Astounding · Posted 11-20-2018 05:40 PM

There isn't a tool that does this exactly, but you can program it:

data want;

set have;

length want_id_num $ 200 one_id $ 10;

do k=1 to countw(have_id_num);

one_id = scan(have_id_num, k, ', ');

if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;

end;

run;

The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.

It's untested code, so might need a small amount of tweaking.

View solution in original post

Astounding · Posted 11-20-2018 05:40 PM

There isn't a tool that does this exactly, but you can program it:

data want;

set have;

length want_id_num $ 200 one_id $ 10;

do k=1 to countw(have_id_num);

one_id = scan(have_id_num, k, ', ');

if index(have_id_num, strip(one_id))=0 then have_id_num = strip(have_id_num) || ', ' || one_id;

end;

run;

The values in the LENGTH statement are guesses ... you have to know a little about your data to provide the right lengths.

It's untested code, so might need a small amount of tweaking.

Ksharp · Posted 11-21-2018 07:46 AM

@Astounding

I think better use INDEXW() or FINDW() for this kind of scenario .

123456,123,

Astounding · Posted 11-21-2018 10:13 AM

Agreed. I saw all equal length values in the post, and didn't consider other possibilities.

novinosrin · Posted 11-20-2018 06:13 PM

same approach but i find catx is convenient

data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;

data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(',',want,temp);
end;
drop temp;
run;

Astounding · Posted 11-20-2018 06:34 PM

Yes, I started with CATX, but found it difficult to include the space after the comma. There might be a way, but it was easier to switch away from CATX. (Of course, the space after the comma might not be a rigid requirement.)

novinosrin · Posted 11-20-2018 06:45 PM

space after comma

want=1033112065, 1629420120, 1033112011

want=catx(', ',want,temp);

data have;
infile cards truncover;
input HAVE_ID_NUM $200.;
/*do _n_=1 to 10e6;*/
/*output;*/
/*end;*/
cards;
1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120
;

data want;
set have;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=scan(HAVE_ID_NUM,_n_,',');
if index((want),strip(temp))=0 then want=catx(', ',want,temp);
end;
drop temp;
run;


data want_array_method;
set have;
array t(100)$10 _temporary_ ;
length want $200;
do _n_=1 to countw(HAVE_ID_NUM,',');
temp=strip(scan(HAVE_ID_NUM,_n_,','));
if strip(temp) not in t then do;
t(_n_)=temp;
end;
end;
want=catx(', ', of t(*));
drop temp;
run;

novinosrin · Posted 11-20-2018 06:50 PM

On 10 million records on my machine

Log report

1055 data have;
1056 infile cards truncover;
1057 input HAVE_ID_NUM $200.;
1058 do _n_=1 to 10e6;
1059 output;
1060 end;
1061 cards;

NOTE: The data set WORK.HAVE has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 4.53 seconds
cpu time 1.81 seconds

1063 ;
1064
1065 data want;
1066 set have;
1067 length want $200;
1068 do _n_=1 to countw(HAVE_ID_NUM,',');
1069 temp=scan(HAVE_ID_NUM,_n_,',');
1070 if index((want),strip(temp))=0 then want=catx(', ',want,temp);
1071 end;
1072 drop temp;
1073 run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 1:07.06
cpu time 23.84 seconds

1074
1075
1076 data want_array_method;
1077 set have;
1078 array t(100)$10 _temporary_ ;
1079 length want $200;
1080 do _n_=1 to countw(HAVE_ID_NUM,',');
1081 temp=strip(scan(HAVE_ID_NUM,_n_,','));
1082 if strip(temp) not in t then do;
1083 t(_n_)=temp;
1084 end;
1085 end;
1086 want=catx(', ', of t(*));
1087 drop temp;
1088 run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT_ARRAY_METHOD has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 28.25 seconds
cpu time 23.56 seconds

34reqrwe · Posted 11-20-2018 06:43 PM

different approach which is reliant on the original list being comma separated.

%let have_id_num = 1033112065, 1629420120, 1033112065, 1629420120, 1033112065, 1033112011, 1629420120;

proc sql ;
create table a as
select
&have_id_num
from
sashelp.air (obs=1)
;
quit;

proc transpose data=a out=b;
run;

proc sql noprint;
select distinct col1 format=20. into: WANT_ID_NUM separated by ', '
from b;
quit;

%put WANT_ID_NUM=&WANT_ID_NUM;

buechler66 · Posted 11-20-2018 07:06 PM

Thank you both for the solutions. I really appreciate your time!

De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

Re: De-dup a list of values stored in a variable separated by commas

SAS Innovate 2025: Call for Content

Classroom Training Available!