Solved: Re: merge columns in the same dataset without duplicates

Scorpx · Posted 05-16-2023 11:43 AM

Hello.

I am trying to clean a dataset with several survey data merged, which means that the field "email" (and several others could be duplicated). I want to create just one variable without the duplicates.

Data example:

Email_1	Email_2	Email_3	Email_4	Email_5
TESTING@NOWHERE.ORG	.	TESTING@NOWHERE.ORG	.	.
.	.	TESTING1@NOWHERE.ORG	.	.
TESTING2@NOWHERE.ORG	TESTING2@NOWHERE.ORG	.	TESTING2@NOWHERE.ORG	.
TESTING3@NOWHERE.ORG	.	.	TESTING3@NOWHERE.ORG	TESTING3@NOWHERE.ORG

What I want to get:

Email_1	Email_2	Email_3	Email_4	Email_5	Email
TESTING@NOWHERE.ORG	.	TESTING@NOWHERE.ORG	.	.	TESTING@NOWHERE.ORG
.	.	TESTING1@NOWHERE.ORG	.	.	TESTING1@NOWHERE.ORG
TESTING2@NOWHERE.ORG	TESTING2@NOWHERE.ORG	.	TESTING2@NOWHERE.ORG	.	TESTING2@NOWHERE.ORG
TESTING3@NOWHERE.ORG	.	.	TESTING3@NOWHERE.ORG	TESTING3@NOWHERE.ORG	TESTING3@NOWHERE.ORG

I found this code on the forum Concatenation removing duplicates :

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   do i=1 to dim(values);
   if indexw(Email,values[i],'/')= 0 then Email= catx('/',Email,values[i]);
   end;
   drop i;
run;

But it concatenates with the "/", like this:

Email_1	Email_2	Email_3	Email_4	Email_5	Email
TESTING@NOWHERE.ORG	.	TESTING@NOWHERE.ORG	.	.	TESTING@NOWHERE.ORG/TESTING@NOWHERE.ORG
.	.	TESTING1@NOWHERE.ORG	.	.	TESTING1@NOWHERE.ORG
TESTING2@NOWHERE.ORG	TESTING2@NOWHERE.ORG	.	TESTING2@NOWHERE.ORG	.	TESTING2@NOWHERE.ORG/TESTING2@NOWHERE.ORG/TESTING2@NOWHERE.ORG
TESTING3@NOWHERE.ORG	.	.	TESTING3@NOWHERE.ORG	TESTING3@NOWHERE.ORG	TESTING3@NOWHERE.ORG/TESTING3@NOWHERE.ORG/TESTING3@NOWHERE.ORG

Thank you for your help.

I am using SAS EG 7.12 (7.100.2.3350) (64-bit)

ballardw · Posted 05-16-2023 04:31 PM

@Scorpx wrote:

Dear @ballardw, thank you for your response.

Still, regarding what you mentioned:

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

This is a critical piece of information. That means that you really only want one value (which is quite different than the request of the post you linked). There is a function that specifically returns the first non-missing value in a list: COALESCEC (or Coalesce for numeric variables)

So to assign the first non-missing value to the variable email:

email = coalescec(of email_:);

View solution in original post

A_Kh · Posted 05-16-2023 12:24 PM

This code needs a little modification:

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   Email= email_1;
   do i=2 to dim(values);
     	if values{i} ne values{1} then Email= catx('/',Email,values{i});
   end;
   drop i;
run;

Scorpx · Posted 05-16-2023 01:51 PM

Thank you for your response. However, it is still repeated and divided it using "/."

A_Kh · Posted 05-16-2023 03:11 PM

Looks like missing values are captured as '.' instead of blanks. There might be hidden special characters as well. I've added STRIP and COMPRESS functions to eliminate '.'(dots) and hidden blanks

    
Data test2;
   set have;
   array values (*) email_1-email_5;
   length Email $ 100 ;
   email= compress(email_1, '. ');
   do i=2 to dim(values);
		if strip(values{i}) ne '.' then Values{i}= strip(values{i}); 
		else Values{i}= '';
     	if compress(values{i}, '. ') ne compress(values{1}, '. ') then Email= catx('/',Email,values{i});
   end;
   drop i;
proc print;run;

ballardw · Posted 05-16-2023 02:00 PM

First thing I am concerned is that you show . in the example "data". That would typically mean in a SAS data set that the value of the variable is numeric and so INDEXW would not typically be appropriate.

Second, you should show what you expect as a result when there are different values in at least some of the observations.

Third provide a data step with some example values. NOTE that the post you linked says "No data so untested".

Try

Data test2;
   set test;
   array values (*) email_1-email_5;
   length Email $ 125 ;
   do i=1 to dim(values);
      if indexw(strip(Email),strip(values[i]),'/')= 0 then Email= catx('/',Email,values[i]);
   end;
   drop i;
run;

Scorpx · Posted 05-16-2023 02:18 PM

Dear @ballardw, thank you for your response.

I appreciate your concerns, and you are probably bored of repeatedly answering the same questions, but I tried to be as specific as possible; your answer seems rude to me. Not because of what you said but how you say it, not all of us have the same level, and I appreciate the effort you and the rest of the community does to help us out; take into consideration that a bit of guidance is always well received.

Still, regarding what you mentioned:

1. I Didn't know "INDEXW" didn't help; my data is not numeric as I showed in the "Example."

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

3. I'd like to do it, but those are emails, and I'd prefer to keep them private. I also read the note on the post; that's one of the reasons I wrote this one.

Again, thank you for the time you took to answer my question. Have a great rest of your day.

ballardw · Posted 05-16-2023 04:31 PM

@Scorpx wrote:

Dear @ballardw, thank you for your response.

Still, regarding what you mentioned:

2. It doesn't have different values (I checked them). However, in the rare case, it has a space or something that makes it appear different, I wouldn't mind having it repeated with "/".

This is a critical piece of information. That means that you really only want one value (which is quite different than the request of the post you linked). There is a function that specifically returns the first non-missing value in a list: COALESCEC (or Coalesce for numeric variables)

So to assign the first non-missing value to the variable email:

email = coalescec(of email_:);

Scorpx · Posted 05-16-2023 04:52 PM

This is the answer. Thank you. Yes, I wanted just one variable with one value:

As I mentioned:

I want to create just one variable without the duplicates

Next time I will try to provide better explanations.

Thank you for your time.

Tom · Posted 05-16-2023 04:09 PM

Your description of what you want is not clear. Do you want to pick just one email address? If not do you want to write a separate observation for each unique email address?

Also what are the rules for detecting duplicate email addresses? Are the email addresses case sensitive? That is do you to treat "Joe@gmail.com" and "JOE@GMAIL.COM" as one email address or two?

I do not see any difference between your input and your output data. Also what is with the periods in you data display?

Let's make some example data. Let's keep some of those periods in the data.

data have;
  infile cards dsd truncover;
  input (email_1-email_5) (:$char100.) ;
cards;
TESTING@NOWHERE.ORG,,TESTING@NOWHERE.ORG,,
,.,TESTING1@NOWHERE.ORG,.,
TESTING2@NOWHERE.ORG,TESTING2@NOWHERE.ORG,,TESTING2@NOWHERE.ORG,
TESTING3@NOWHERE.ORG,,.,TESTING3@NOWHERE.ORG,TESTING3@NOWHERE.ORG
;

Now let's loop over the source variables and only put new values into the temporary array.

Then we can either concatenate the values into a single LOOONG variable.

data want;
  length email_list $500 ;
  set have ;
  array emails email_1-email_5;
  array temp [5] $100 _temporary_;
  do index=1 to dim(emails);
    if compress(emails[index],,'kad') ne ' ' then do;
      if emails[index] not in temp then temp[index]=emails[index];
    end;
  end;
  email_list = catx(';', of temp[*]) ;
  drop index ;
run;

Obs    email_list                                         email_1           email_2

 1     TESTING@NOWHERE.ORG                          TESTING@NOWHERE.ORG
 2     TESTING@NOWHERE.ORG;TESTING1@NOWHERE.ORG                             .
 3     TESTING2@NOWHERE.ORG;TESTING1@NOWHERE.ORG    TESTING2@NOWHERE.ORG    TESTING2@NOWHERE.ORG
 4     TESTING3@NOWHERE.ORG;TESTING1@NOWHERE.ORG    TESTING3@NOWHERE.ORG

Obs    email_3                 email_4                       email_5

 1     TESTING@NOWHERE.ORG
 2     TESTING1@NOWHERE.ORG    .
 3                             TESTING2@NOWHERE.ORG
 4     .                       TESTING3@NOWHERE.ORG    TESTING3@NOWHERE.ORG

Or into multiple observations.


data want;
  length email_number 8 email $100 ;
  set have ;
  array emails email_1-email_5;
  array temp [5] $100 _temporary_;
  do index=1 to dim(emails);
    if compress(emails[index],,'kad') ne ' ' then do;
      if emails[index] not in temp then temp[index]=emails[index];
    end;
  end;
  email_number=0;
  do index=1 to dim(temp);
    if temp[index] ne ' ' then do;
      email_number + 1;
      email = temp[index];
      output;
    end;
  end;
  drop index;
run;

Result

  e
  m
  a
  i
  l
  _                               e                    e                    e                    e                    e
  n                               m                    m                    m                    m                    m
  u          e                    a                    a                    a                    a                    a
  m          m                    i                    i                    i                    i                    i
O b          a                    l                    l                    l                    l                    l
b e          i                    _                    _                    _                    _                    _
s r          l                    1                    2                    3                    4                    5

1 1 TESTING@NOWHERE.ORG  TESTING@NOWHERE.ORG                       TESTING@NOWHERE.ORG
2 1 TESTING@NOWHERE.ORG                       .                    TESTING1@NOWHERE.ORG .
3 2 TESTING1@NOWHERE.ORG                      .                    TESTING1@NOWHERE.ORG .
4 1 TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG                      TESTING2@NOWHERE.ORG
5 2 TESTING1@NOWHERE.ORG TESTING2@NOWHERE.ORG TESTING2@NOWHERE.ORG                      TESTING2@NOWHERE.ORG
6 1 TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG                      .                    TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG
7 2 TESTING1@NOWHERE.ORG TESTING3@NOWHERE.ORG                      .                    TESTING3@NOWHERE.ORG TESTING3@NOWHERE.ORG

Scorpx · Posted 05-16-2023 04:41 PM

Dear @Tom ,

Thank you for your response. Since I seem to be in error, please allow me to explain myself again.

1.- I have five columns of emails (Email_1 - Email_5); they are capitalized to avoid the Joe or JOE or JoE, etc.

2.- Each row has the same email or none (because it is the same respondent/case, but a different survey); no column has every email.

3.- I want to have a column named "email" that collects each email to avoid having five different "Email_1, Email_2, Email_3..." so, instead of having 10,000 emails spread and repeated over five variables, I want to have just one variable with the 10,000 emails.

4.- The output has this "email" column summarizing the emails.

5.- Please think of the dots as blanks. (this want is my bad, apologies).

I hope I have given you a better description this time, I appreciate the help you are providing.

Regards.

Registration is open

SAS Training: Just a Click Away