BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kpberger
Obsidian | Level 7

Hello. I would like to eliminate duplicate values within an observation, across many columns. For example, if I have these data:

id var1 var2 var3 var4 var5
1 49800 Main St. 222 Horg Blvd 498 Main St. 329 Old Ave. 109 E. Chorg St.
2 178 Forg Pl. 781 Noodle Rd. 546 Sunshine Dr.    
3 329 Old Ave. 329 Old Ave. 329 Old Ave. 80375 Zorg Ln.  
4 3872 Gorg Rd.        
5 491 N. Yay Way 2985 Cool St. 172 Harbor 172 Harbor 172 Harbor

 

I would instead like to have these data:

id var1 var2 var3 var4 var5
1 49800 Main St. 222 Horg Blvd 329 Old Ave. 109 E. Chorg St.  
2 178 Forg Pl. 781 Noodle Rd. 546 Sunshine Dr.    
3 329 Old Ave. 80375 Zorg Ln.      
4 3872 Gorg Rd.        
5 491 N. Yay Way 2985 Cool St. 172 Harbor    


keeping the first instance of a value within an observation. Transposing the data set is fine, but there are ~150k observations (and 332 columns) so there would be ~150k columns if transposed, which my computer has trouble with.

I would very much appreciate any help. I have been trying things all day and haven't yet found an approach. Below is code to generate the example data set. Thank you!

 

data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,49800 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;

 Edited for better example data

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!

Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.

 

In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use

length var1-var5 $ 17 ;

or what ever the longest length any of the variable may have.

 

This may do what you want.

data want;
   /* force all the variables to the same length
      just in case ,repead Id on this length
      so it still appears as first variable in the set
   */
   length id 8 var1-var5 $ 20; 
   set have;
   array v (*) var1-var5;
   /* set dupes to missing*/
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if v[i]=v[j] then call missing(v[j]);
      end;
   end;
   /* count missing elements that need to be moved*/
   do i=1 to dim(v);
      misscount= sum(misscount,missing(v[i]));
   end;
   /*push the missings to the end*/

   do k=1 to misscount;
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if missing (v[i]) then do;
            v[i]=v[i+1];
            call missing(v[i+1]);
         end;;
      end;
   end;
   end;

   drop i j k misscount;

run;

First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.

Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as

if  upcase(v[i]) = upcase (v[j]) then call missing(v[j]);

Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second

spelling or "123 Old St" and "123  Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.

 

The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.

View solution in original post

10 REPLIES 10
Tom
Super User Tom
Super User

I don't think you quite understand how to TRANSPOSE.

proc transpose data=have out=tall;
  by id;
  var var1-var5 ;
run;

proc sort nodupkey ;
  by id col1 ;
run;

proc transpose data=tall out=want(drop=_name_) prefix=var;
  by id ;
  var col1;
run;

Of course it might be a lot easier to work with the data if you skipped that last step and left it has multiple observations.

kpberger
Obsidian | Level 7

This works except I do need to keep the order of the data. This solution reorders things alphabetically. My fault for making the example data alphabetical. I'll change it.

ballardw
Super User

Your data implies the result is sorted. Is that actually case or do you need to retain relative positions of values?

 

If order is not important this becomes quite a bit simpler to code.

 

Also are the variables all of the same length? If not shifting position can cause truncation issues.

One more question, are the values of the variables single words or are there multiple words within some of the variables?

kpberger
Obsidian | Level 7
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!
Reeza
Super User

If order matters use a data step to transpose it instead of PROC TRANSPOSE and add a order counter in that step. Then you can use it to re-order when you re-transpose it if needed.


https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

ballardw
Super User

@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.

The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!

Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.

 

In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use

length var1-var5 $ 17 ;

or what ever the longest length any of the variable may have.

 

This may do what you want.

data want;
   /* force all the variables to the same length
      just in case ,repead Id on this length
      so it still appears as first variable in the set
   */
   length id 8 var1-var5 $ 20; 
   set have;
   array v (*) var1-var5;
   /* set dupes to missing*/
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if v[i]=v[j] then call missing(v[j]);
      end;
   end;
   /* count missing elements that need to be moved*/
   do i=1 to dim(v);
      misscount= sum(misscount,missing(v[i]));
   end;
   /*push the missings to the end*/

   do k=1 to misscount;
   do i= 1 to (dim(v)-1);
      do j=(i+1) to dim(v);
         if missing (v[i]) then do;
            v[i]=v[i+1];
            call missing(v[i+1]);
         end;;
      end;
   end;
   end;

   drop i j k misscount;

run;

First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.

Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as

if  upcase(v[i]) = upcase (v[j]) then call missing(v[j]);

Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second

spelling or "123 Old St" and "123  Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.

 

The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.

Astounding
PROC Star
Ballardw, good approach. I think you need to tighten up the logic in the final section to properly handle consecutive blanks.
kpberger
Obsidian | Level 7
Fantastic, it works great. Thank you very much! I really appreciate you taking the time.
Kurt_Bremser
Super User

Working with series is always easier with a long dataset layout, see Maxim 19.

To dedup and keep the order, use a hash object:

proc transpose
  data=have
  out=long
;
by id;
var var:;
run;

data want;
set long;
by id;
if _n_ = 1
then do;
  declare hash h ();
  h.definekey("col1");
  h.definedone();
end;
if first.id then rc = h.clear();
if h.find() = 0
then delete;
else rc = h.add();
run;

Untested, posted from my tablet.

Ksharp
Super User
data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,498 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;

data want;
 set have;
 array x{*} $ var1-var5;
 array y{*} $ 80 new1-new5;
n=0;
do i=1 to dim(x);
 if x{i} not in y then do;n+1;y{n}=x{i};end;
end;
drop i n var1-var5 ;
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 890 views
  • 1 like
  • 7 in conversation