topic How do I remove duplicate entries in a string leaving only unique entries in New SAS User

How do I remove duplicate entries in a string leaving only unique entries

Tracy_Bis — Sun, 05 Jul 2020 04:51:03 GMT

I have a data set (.cvs), that I use on a daily basis (includes about 60-75 variables). The source of this data set has gone through some upgrades, and now automatically pulls data into the system. It will not override data, so the system just adds a new iteration (to those questions that allow it) even if it is the exact same data that is already in there. So for each person (row) some of my variables have a string of data separated by commas. I have attached an example excel spreadsheet, showing the data I have and what I want (I work with health data, so this is an example of what my data set looks like). Now, if the unique variables cannot be separated into their own columns, I am okay with having them in the same column.

My SAS skills (SAS 9.4) are fairly minimal (proc freqs, some simple SQL, merging, concatenation, sorting, pulling first obs of a string, etc.)

I have searched on line and tried the following:

data want;
set have;
Want_date=scan(Have_Date, 1, ' ');
do i=2 to countw(Have_Date,' ');
word=scan(Have_Date, i, ' ');
found=find(Want_date, word, 'it');
if found=0 then Want_date=catx(' ', Want_date, word);
end;
run;

This did not work. But to be honest I wasn't sure what this was exactly doing, so I may not have done it right.

When that didn't work I also found this code online and tried it:

data want;
if not prxId then prxId + prxParse("s/\b(\w{2,})\b(.*)\b(\1\s*)\b/\1\2/io");
set have;
Want_date= Have_Date;
do i = 1 to 100 until (times=0);
call prxChange(prxId,1,Want_date,Want_date,len,trunc,times);
end;
drop i prxId len trunc times;
run;

This was close, I ended up with a single date but it was followed by all the commas and the slash marks from the other dates in the field. (example: 05/30/2019,//,//,//,//,//,//,//,//). Again, it could be because I am not sure what the code is all doing, so I wasn't able to correct it to remove the commas and slash marks.

I just sent this one example but I have multiple columns that all have this same problem. So any help would be appreciated.

Thank you

Re: How do I remove duplicate entries in a string leaving only unique entries

Shmuel — Sun, 05 Jul 2020 05:41:47 GMT

So in your input are mainly 3 variables (I omit prefix HAVE_ ) :

person_id, date and store.

Try next code:

data temp(keep=person_ID datex store);
 set have(rename=(have_person_id = person_ID));
       i=1;
      /* assuming number of date = number of stores in strings */
      do until (missing(date));
           datex = scan(have_date,i,',');
           store = scan(have_store,i,',');
           output;
           i+1;
      end;
run;
proc sort data=temp out=temp1 nodupkey;
  by person_id datex;
run;

You can use the result dataset as is for any analysis you need, without replacing the long format to wide format, as shown in wanted output.

If you insist to have the wide format use PROC TRANSPOSE.

Re: How do I remove duplicate entries in a string leaving only unique entries

ed_sas_member — Sun, 05 Jul 2020 08:09:38 GMT

Hi @Tracy_Bis

Here is an attempt to achieve this:

data have;
	infile datalines dlm="09"x;
	input Have_PersonID	$ Have_Date:$200. Have_Store:$200.;
	datalines;
123	04/12/2019,04/12/2019,04/12/2019,04/12/2019,04/12/2019,04/12/2019,04/12/2019,04/12/2019	Tracy's Store,Tracy's Store,Tracy's Store,Tracy's Store,Tracy's Store,Tracy's Store,Tracy's Store,Tracy's Store
457	07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019,07/05/2019	,,,,,,,,,,,,,,
789	03/19/2019,03/19/2019,03/19/2019,03/19/2019,03/19/2019,03/19/2019,03/19/2019,03/19/2019,03/19/2019	,,,,,,,,,
101	05/18/2019,05/18/2019	Mickeys,Mickeys
234	08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019,08/02/2019	Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND,Donalds, ND
557	05/19/2019,05/19/2019,05/19/2019,05/19/2019,05/30/2019,05/30/2019,05/30/2019	Daisy's, TX,Daisy's, TX,Daisy's, TX,Daisy's, TX,Pluto, MT,Pluto, MT,Pluto, MT
891	06/21/2019,06/21/2019,06/21/2019,07/10/2019	Hewy, Dewy, And Lewy,Hewy, Dewy, And Lewy,Hewy, Dewy, And Lewy,Goofy, Daffy, And Minnie
;
run;

%macro _mymac(param);
	data _have_&param. (keep= Have_PersonID have_&param._i i);
		set have;
		do i=1 to countw(Have_&param.);
			have_&param._i = scan(Have_&param.,i,",");
			if not missing(have_&param._i) then output _have_&param.;
		end;
	run;
	
	proc sort data=_have_&param. nodupkey;
		by Have_PersonID have_&param._i;
	run;
	
	proc sort data=_have_&param.;
		by Have_PersonID i;
	run;
	proc transpose data=_have_&param. out=_have_&param._tr (drop=_:);
		var have_&param._i;
		by Have_PersonID;
	run;
	
	data _have_&param._final;
		set _have_&param._tr;
		length have_&param._i $ 200;
		want_&param. = catx(", ", of col:);
		keep Have_PersonID want_&param.;
	run;
%mend;

%_mymac(date)
%_mymac(store)

data want;
	merge _have_date_final _have_store_final;
	by Have_PersonID;
run;

Output:

All the best,