BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Scorpx
Obsidian | Level 7

Hello.

 

I am trying to clean a dataset with several survey data merged, which means that the field "email" (and several others could be duplicated). I want to create just one variable without the duplicates.

 

Data example:

Email_1Email_2Email_3Email_4Email_5
TESTING@NOWHERE.ORG.TESTING@NOWHERE.ORG..
..TESTING1@NOWHERE.ORG..
TESTING2@NOWHERE.ORGTESTING2@NOWHERE.ORG.TESTING2@NOWHERE.ORG.
TESTING3@NOWHERE.ORG..TESTING3@NOWHERE.ORGTESTING3@NOWHERE.ORG

 

What I want to get:

Email_1Email_2Email_3Email_4Email_5Email
TESTING@NOWHERE.ORG.TESTING@NOWHERE.ORG..TESTING@NOWHERE.ORG
..TESTING1@NOWHERE.ORG..TESTING1@NOWHERE.ORG
TESTING2@NOWHERE.ORGTESTING2@NOWHERE.ORG.TESTING2@NOWHERE.ORG.TESTING2@NOWHERE.ORG
TESTING3@NOWHERE.ORG..TESTING3@NOWHERE.ORGTESTING3@NOWHERE.ORGTESTING3@NOWHERE.ORG

 

I found this code on the forum Concatenation removing duplicates :

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   do i=1 to dim(values);
   if indexw(Email,values[i],'/')= 0 then Email= catx('/',Email,values[i]);
   end;
   drop i;
run;

But it concatenates with the "/", like this:

Email_1Email_2Email_3Email_4Email_5Email
TESTING@NOWHERE.ORG.TESTING@NOWHERE.ORG..TESTING@NOWHERE.ORG/TESTING@NOWHERE.ORG
..TESTING1@NOWHERE.ORG..TESTING1@NOWHERE.ORG
TESTING2@NOWHERE.ORGTESTING2@NOWHERE.ORG.TESTING2@NOWHERE.ORG.TESTING2@NOWHERE.ORG/TESTING2@NOWHERE.ORG/TESTING2@NOWHERE.ORG
TESTING3@NOWHERE.ORG..TESTING3@NOWHERE.ORGTESTING3@NOWHERE.ORGTESTING3@NOWHERE.ORG/TESTING3@NOWHERE.ORG/TESTING3@NOWHERE.ORG

 

Thank you for your help.

 

I am using SAS EG 7.12 (7.100.2.3350) (64-bit)

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@Scorpx wrote:

Dear @ballardw, thank you for your response.

Still, regarding what you mentioned:

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

 


This is a critical piece of information. That means that you really only want one value (which is quite different than the request of the post you linked). There is  a function that specifically returns the first non-missing value in a list: COALESCEC (or Coalesce for numeric variables)

So to assign the first non-missing value to the variable email:

email = coalescec(of email_:);

 

 

 

View solution in original post

9 REPLIES 9
A_Kh
Lapis Lazuli | Level 10

This code needs a little modification: 

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   Email= email_1;
   do i=2 to dim(values);
     	if values{i} ne values{1} then Email= catx('/',Email,values{i});
   end;
   drop i;
run;
Scorpx
Obsidian | Level 7

Thank you for your response. However, it is still repeated and divided it using "/."

A_Kh
Lapis Lazuli | Level 10

Looks like missing values are captured as '.' instead of blanks. There might be hidden special characters as well. I've added STRIP and COMPRESS functions to eliminate '.'(dots) and hidden blanks

    
Data test2;
   set have;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   email= compress(email_1, '. ');
   do i=2 to dim(values);
		if strip(values{i}) ne '.' then Values{i}= strip(values{i}); 
		else Values{i}= '';
     	if compress(values{i}, '. ') ne compress(values{1}, '. ') then Email= catx('/',Email,values{i});
   end;
   drop i;
proc print;run;



ballardw
Super User

First thing I am concerned is that you show . in the example "data". That would typically mean in a SAS data set that the value of the variable is numeric and so INDEXW would not typically be appropriate.

Second, you should show what you expect as a result when there are different values in at least some of the observations.

 

Third provide a data step with some example values. NOTE that the post you linked says "No data so untested".

Try

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 125 ;
   do i=1 to dim(values);
      if indexw(strip(Email),strip(values[i]),'/')= 0 then Email= catx('/',Email,values[i]);
   end;
   drop i;
run;
Scorpx
Obsidian | Level 7

Dear @ballardw, thank you for your response.

 

I appreciate your concerns, and you are probably bored of repeatedly answering the same questions, but I tried to be as specific as possible; your answer seems rude to me. Not because of what you said but how you say it, not all of us have the same level, and I appreciate the effort you and the rest of the community does to help us out; take into consideration that a bit of guidance is always well received.

 

Still, regarding what you mentioned:

1. I Didn't know "INDEXW" didn't help; my data is not numeric as I showed in the "Example."

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

3. I'd like to do it, but those are emails, and I'd prefer to keep them private. I also read the note on the post; that's one of the reasons I wrote this one.

 

Again, thank you for the time you took to answer my question. Have a great rest of your day.

ballardw
Super User

@Scorpx wrote:

Dear @ballardw, thank you for your response.

Still, regarding what you mentioned:

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

 


This is a critical piece of information. That means that you really only want one value (which is quite different than the request of the post you linked). There is  a function that specifically returns the first non-missing value in a list: COALESCEC (or Coalesce for numeric variables)

So to assign the first non-missing value to the variable email:

email = coalescec(of email_:);

 

 

 

Scorpx
Obsidian | Level 7

This is the answer. Thank you. Yes, I wanted just one variable with one value:

 

As I mentioned:

I want to create just one variable without the duplicates

Next time I will try to provide better explanations. 

 

Thank you for your time.

Tom
Super User Tom
Super User

Your description of what you want is not clear.  Do you want to pick just one email address?  If not do you want to write a separate observation for each unique email address? 

 

Also what are the rules for detecting duplicate email addresses? Are the email addresses case sensitive?  That is do you to treat "Joe@gmail.com" and "JOE@GMAIL.COM" as one email address or two? 

 

I do not see any difference between your input and your output data.  Also what is with the periods in you data display?

 

Let's make some example data.  Let's keep some of those periods in the data.

data have;
  infile cards dsd truncover;
  input (email_1-email_5) (:$char100.) ;
cards;
TESTING@NOWHERE.ORG,,TESTING@NOWHERE.ORG,,
,.,TESTING1@NOWHERE.ORG,.,
TESTING2@NOWHERE.ORG,TESTING2@NOWHERE.ORG,,TESTING2@NOWHERE.ORG,
TESTING3@NOWHERE.ORG,,.,TESTING3@NOWHERE.ORG,TESTING3@NOWHERE.ORG
;

Now let's loop over the source variables and only put new values into the temporary array.

Then we can either concatenate the values into a single LOOONG variable.

data want;
  length email_list $500 ;
  set have ;
  array emails email_1-email_5;
  array temp [5] $100 _temporary_;
  do index=1 to dim(emails);
    if compress(emails[index],,'kad') ne ' ' then do;
      if emails[index] not in temp then temp[index]=emails[index];
    end;
  end;
  email_list = catx(';', of temp[*]) ;
  drop index ;
run;
Obs    email_list                                         email_1           email_2

 1     TESTING@NOWHERE.ORG                          TESTING@NOWHERE.ORG
 2     TESTING@NOWHERE.ORG;TESTING1@NOWHERE.ORG                             .
 3     TESTING2@NOWHERE.ORG;TESTING1@NOWHERE.ORG    TESTING2@NOWHERE.ORG    TESTING2@NOWHERE.ORG
 4     TESTING3@NOWHERE.ORG;TESTING1@NOWHERE.ORG    TESTING3@NOWHERE.ORG

Obs    email_3                 email_4                       email_5

 1     TESTING@NOWHERE.ORG
 2     TESTING1@NOWHERE.ORG    .
 3                             TESTING2@NOWHERE.ORG
 4     .                       TESTING3@NOWHERE.ORG    TESTING3@NOWHERE.ORG

Or into multiple observations.


data want;
  length email_number 8 email $100 ;
  set have ;
  array emails email_1-email_5;
  array temp [5] $100 _temporary_;
  do index=1 to dim(emails);
    if compress(emails[index],,'kad') ne ' ' then do;
      if emails[index] not in temp then temp[index]=emails[index];
    end;
  end;
  email_number=0;
  do index=1 to dim(temp);
    if temp[index] ne ' ' then do;
      email_number + 1;
      email = temp[index];
      output;
    end;
  end;
  drop index;
run;

Result

  e
  m
  a
  i
  l
  _                               e                    e                    e                    e                    e
  n                               m                    m                    m                    m                    m
  u          e                    a                    a                    a                    a                    a
  m          m                    i                    i                    i                    i                    i
O b          a                    l                    l                    l                    l                    l
b e          i                    _                    _                    _                    _                    _
s r          l                    1                    2                    3                    4                    5

1 1 TESTING@NOWHERE.ORG  TESTING@NOWHERE.ORG                       TESTING@NOWHERE.ORG
2 1 TESTING@NOWHERE.ORG                       .                    TESTING1@NOWHERE.ORG .
3 2 TESTING1@NOWHERE.ORG                      .                    TESTING1@NOWHERE.ORG .
4 1 TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG                      TESTING2@NOWHERE.ORG
5 2 TESTING1@NOWHERE.ORG TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG                      TESTING2@NOWHERE.ORG
6 1 TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG                      .                    TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG
7 2 TESTING1@NOWHERE.ORG TESTING3@NOWHERE.ORG                      .                    TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG

 

Scorpx
Obsidian | Level 7

Dear @Tom ,

 

Thank you for your response. Since I seem to be in error, please allow me to explain myself again.

 

1.- I have five columns of emails (Email_1 - Email_5); they are capitalized to avoid the Joe or JOE or JoE, etc. 

2.- Each row has the same email or none (because it is the same respondent/case, but a different survey); no column has every email. 

3.- I want to have a column named "email" that collects each email to avoid having five different "Email_1, Email_2, Email_3..." so, instead of having 10,000 emails spread and repeated over five variables, I want to have just one variable with the 10,000 emails.

4.- The output has this "email" column summarizing the emails.

5.- Please think of the dots as blanks. (this want is my bad, apologies).

 

Scorpx_0-1684269491674.png

I hope I have given you a better description this time, I appreciate the help you are providing.

 

Regards.

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1190 views
  • 0 likes
  • 4 in conversation