Solved: De-dup across columns within an observation

kpberger · Posted 08-19-2021 07:09 PM

Hello. I would like to eliminate duplicate values within an observation, across many columns. For example, if I have these data:

id	var1	var2	var3	var4	var5
1	49800 Main St.	222 Horg Blvd	498 Main St.	329 Old Ave.	109 E. Chorg St.
2	178 Forg Pl.	781 Noodle Rd.	546 Sunshine Dr.
3	329 Old Ave.	329 Old Ave.	329 Old Ave.	80375 Zorg Ln.
4	3872 Gorg Rd.
5	491 N. Yay Way	2985 Cool St.	172 Harbor	172 Harbor	172 Harbor

I would instead like to have these data:

id	var1	var2	var3	var4	var5
1	49800 Main St.	222 Horg Blvd	329 Old Ave.	109 E. Chorg St.
2	178 Forg Pl.	781 Noodle Rd.	546 Sunshine Dr.
3	329 Old Ave.	80375 Zorg Ln.
4	3872 Gorg Rd.
5	491 N. Yay Way	2985 Cool St.	172 Harbor

keeping the first instance of a value within an observation. Transposing the data set is fine, but there are ~150k observations (and 332 columns) so there would be ~150k columns if transposed, which my computer has trouble with.

I would very much appreciate any help. I have been trying things all day and haven't yet found an approach. Below is code to generate the example data set. Thank you!

data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,49800 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;

Edited for better example data

ballardw · Posted 08-20-2021 01:00 AM

@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!

Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.

In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use

length var1-var5 $ 17 ;

or what ever the longest length any of the variable may have.

This may do what you want.

data want;
   /* force all the variables to the same length
      just in case ,repead Id on this length
      so it still appears as first variable in the set
   */
   length id 8 var1-var5 $ 20; 
   set have;
   array v (*) var1-var5;
   /* set dupes to missing*/
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if v[i]=v[j] then call missing(v[j]);
      end;
   end;
   /* count missing elements that need to be moved*/
   do i=1 to dim(v);
      misscount= sum(misscount,missing(v[i]));
   end;
   /*push the missings to the end*/

   do k=1 to misscount;
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if missing (v[i]) then do;
            v[i]=v[i+1];
            call missing(v[i+1]);
         end;;
      end;
   end;
   end;

   drop i j k misscount;

run;

First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.

Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as

if  upcase(v[i]) = upcase (v[j]) then call missing(v[j]);

Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second

spelling or "123 Old St" and "123 Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.

The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.

View solution in original post

Tom · Posted 08-19-2021 07:40 PM

I don't think you quite understand how to TRANSPOSE.

proc transpose data=have out=tall;
  by id;
  var var1-var5 ;
run;

proc sort nodupkey ;
  by id col1 ;
run;

proc transpose data=tall out=want(drop=_name_) prefix=var;
  by id ;
  var col1;
run;

Of course it might be a lot easier to work with the data if you skipped that last step and left it has multiple observations.

kpberger · Posted 08-19-2021 08:09 PM

This works except I do need to keep the order of the data. This solution reorders things alphabetically. My fault for making the example data alphabetical. I'll change it.

ballardw · Posted 08-19-2021 07:57 PM

Your data implies the result is sorted. Is that actually case or do you need to retain relative positions of values?

If order is not important this becomes quite a bit simpler to code.

Also are the variables all of the same length? If not shifting position can cause truncation issues.

One more question, are the values of the variables single words or are there multiple words within some of the variables?

kpberger · Posted 08-19-2021 08:11 PM

I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!

Reeza · Posted 08-19-2021 11:16 PM

If order matters use a data step to transpose it instead of PROC TRANSPOSE and add a order counter in that step. Then you can use it to re-order when you re-transpose it if needed.

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

ballardw · Posted 08-20-2021 01:00 AM

@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!

Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.

In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use

length var1-var5 $ 17 ;

or what ever the longest length any of the variable may have.

This may do what you want.

data want;
   /* force all the variables to the same length
      just in case ,repead Id on this length
      so it still appears as first variable in the set
   */
   length id 8 var1-var5 $ 20; 
   set have;
   array v (*) var1-var5;
   /* set dupes to missing*/
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if v[i]=v[j] then call missing(v[j]);
      end;
   end;
   /* count missing elements that need to be moved*/
   do i=1 to dim(v);
      misscount= sum(misscount,missing(v[i]));
   end;
   /*push the missings to the end*/

   do k=1 to misscount;
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if missing (v[i]) then do;
            v[i]=v[i+1];
            call missing(v[i+1]);
         end;;
      end;
   end;
   end;

   drop i j k misscount;

run;

First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.

Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as

if  upcase(v[i]) = upcase (v[j]) then call missing(v[j]);

Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second

spelling or "123 Old St" and "123 Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.

The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.

Astounding · Posted 08-20-2021 02:25 AM

Ballardw, good approach. I think you need to tighten up the logic in the final section to properly handle consecutive blanks.

kpberger · Posted 08-20-2021 05:39 PM

Fantastic, it works great. Thank you very much! I really appreciate you taking the time.

Kurt_Bremser · Posted 08-20-2021 03:09 AM

Working with series is always easier with a long dataset layout, see Maxim 19.

To dedup and keep the order, use a hash object:

proc transpose
  data=have
  out=long
;
by id;
var var:;
run;

data want;
set long;
by id;
if _n_ = 1
then do;
  declare hash h ();
  h.definekey("col1");
  h.definedone();
end;
if first.id then rc = h.clear();
if h.find() = 0
then delete;
else rc = h.add();
run;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 08-21-2021 06:33 AM

data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,498 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;

data want;
 set have;
 array x{*} $ var1-var5;
 array y{*} $ 80 new1-new5;
n=0;
do i=1 to dim(x);
 if x{i} not in y then do;n+1;y{n}=x{i};end;
end;
drop i n var1-var5 ;
run;

De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Re: De-dup across columns within an observation

Classroom Training Available!