Hello. I would like to eliminate duplicate values within an observation, across many columns. For example, if I have these data:
id | var1 | var2 | var3 | var4 | var5 |
1 | 49800 Main St. | 222 Horg Blvd | 498 Main St. | 329 Old Ave. | 109 E. Chorg St. |
2 | 178 Forg Pl. | 781 Noodle Rd. | 546 Sunshine Dr. | ||
3 | 329 Old Ave. | 329 Old Ave. | 329 Old Ave. | 80375 Zorg Ln. | |
4 | 3872 Gorg Rd. | ||||
5 | 491 N. Yay Way | 2985 Cool St. | 172 Harbor | 172 Harbor | 172 Harbor |
I would instead like to have these data:
id | var1 | var2 | var3 | var4 | var5 |
1 | 49800 Main St. | 222 Horg Blvd | 329 Old Ave. | 109 E. Chorg St. | |
2 | 178 Forg Pl. | 781 Noodle Rd. | 546 Sunshine Dr. | ||
3 | 329 Old Ave. | 80375 Zorg Ln. | |||
4 | 3872 Gorg Rd. | ||||
5 | 491 N. Yay Way | 2985 Cool St. | 172 Harbor |
keeping the first instance of a value within an observation. Transposing the data set is fine, but there are ~150k observations (and 332 columns) so there would be ~150k columns if transposed, which my computer has trouble with.
I would very much appreciate any help. I have been trying things all day and haven't yet found an approach. Below is code to generate the example data set. Thank you!
data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,49800 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;
Edited for better example data
@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.
The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!
Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.
In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use
length var1-var5 $ 17 ;
or what ever the longest length any of the variable may have.
This may do what you want.
data want; /* force all the variables to the same length just in case ,repead Id on this length so it still appears as first variable in the set */ length id 8 var1-var5 $ 20; set have; array v (*) var1-var5; /* set dupes to missing*/ do i= 1 to (dim(v)-1); do j=(i+1) to dim(v); if v[i]=v[j] then call missing(v[j]); end; end; /* count missing elements that need to be moved*/ do i=1 to dim(v); misscount= sum(misscount,missing(v[i])); end; /*push the missings to the end*/ do k=1 to misscount; do i= 1 to (dim(v)-1); do j=(i+1) to dim(v); if missing (v[i]) then do; v[i]=v[i+1]; call missing(v[i+1]); end;; end; end; end; drop i j k misscount; run;
First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.
Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as
if upcase(v[i]) = upcase (v[j]) then call missing(v[j]);
Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second
spelling or "123 Old St" and "123 Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.
The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.
I don't think you quite understand how to TRANSPOSE.
proc transpose data=have out=tall;
by id;
var var1-var5 ;
run;
proc sort nodupkey ;
by id col1 ;
run;
proc transpose data=tall out=want(drop=_name_) prefix=var;
by id ;
var col1;
run;
Of course it might be a lot easier to work with the data if you skipped that last step and left it has multiple observations.
This works except I do need to keep the order of the data. This solution reorders things alphabetically. My fault for making the example data alphabetical. I'll change it.
Your data implies the result is sorted. Is that actually case or do you need to retain relative positions of values?
If order is not important this becomes quite a bit simpler to code.
Also are the variables all of the same length? If not shifting position can cause truncation issues.
One more question, are the values of the variables single words or are there multiple words within some of the variables?
If order matters use a data step to transpose it instead of PROC TRANSPOSE and add a order counter in that step. Then you can use it to re-order when you re-transpose it if needed.
https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/
@kpberger wrote:
I do need to retain the relative positions. I edited my post so the example data reflects that, thanks.
The variables are street addresses. "123 Sesame St., Who Knows, State 12345" So all different lengths and all have multiple words. Thanks for any help!
Not the lengths of the values, are the defined lengths of the variables the same? This may have been set when you read the data into SAS. If not, that may be the first step.
In your new example data set you should read all the variables into the same length variable. When Var2 has length 15 and Var3 has length 17 if the value of Var3 has to shift to Var2 you will lose 2 characters. Easy to fix. Better to use
length var1-var5 $ 17 ;
or what ever the longest length any of the variable may have.
This may do what you want.
data want; /* force all the variables to the same length just in case ,repead Id on this length so it still appears as first variable in the set */ length id 8 var1-var5 $ 20; set have; array v (*) var1-var5; /* set dupes to missing*/ do i= 1 to (dim(v)-1); do j=(i+1) to dim(v); if v[i]=v[j] then call missing(v[j]); end; end; /* count missing elements that need to be moved*/ do i=1 to dim(v); misscount= sum(misscount,missing(v[i])); end; /*push the missings to the end*/ do k=1 to misscount; do i= 1 to (dim(v)-1); do j=(i+1) to dim(v); if missing (v[i]) then do; v[i]=v[i+1]; call missing(v[i+1]); end;; end; end; end; drop i j k misscount; run;
First caveat: no claim to optimal efficiency, there are a lot of comparisons going on.
Second, capitalization differences, example 123 Old St and 123 old st, are considered different. If that is of concern all of the comparisons should use in the set dupes to missing loops to standardize the case such as
if upcase(v[i]) = upcase (v[j]) then call missing(v[j]);
Third, this will not catch minor spelling differences such as "123 Old St" and "123 Old St." <note the . in the second
spelling or "123 Old St" and "123 Old St", extra space between 123 and Old, Catching those is whole different, ie complicated, process. If you may have such minor steps then the thing to do is standardize elements prior to this removal step.
The general flow is remove the later, higher numbered variable, of equal values. Then determine how many missing we have to shift to the end. The second block of do loops uses that bit of information to repeatedly shift values to the left if multiple missing values were assigned in between values to keep.
Working with series is always easier with a long dataset layout, see Maxim 19.
To dedup and keep the order, use a hash object:
proc transpose
data=have
out=long
;
by id;
var var:;
run;
data want;
set long;
by id;
if _n_ = 1
then do;
declare hash h ();
h.definekey("col1");
h.definedone();
end;
if first.id then rc = h.clear();
if h.find() = 0
then delete;
else rc = h.add();
run;
Untested, posted from my tablet.
data have;
infile datalines delimiter=',';
length var1 $17 var2 $15 var3 $17 var4 $15 var5 $17;
input id var1 var2 var3 var4 var5;
datalines;
1,498 Main St.,222 Horg Blvd,498 Main St.,329 Old Ave.,109 E. Chorg St.
2,178 Forg Pl.,781 Noodle Rd.,546 Sunshine Dr.,.,.
3,329 Old Ave.,329 Old Ave.,329 Old Ave.,80375 Zorg Ln.,.
4,3872 Gorg Rd.,.,.,.,.
5,491 N. Yay Way,2985 Cool St.,172 Harbor,172 Harbor,172 Harbor
;
data want;
set have;
array x{*} $ var1-var5;
array y{*} $ 80 new1-new5;
n=0;
do i=1 to dim(x);
if x{i} not in y then do;n+1;y{n}=x{i};end;
end;
drop i n var1-var5 ;
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.