Solved: finding duplicates

GreggB · Posted 01-03-2012 02:15 PM

i have data as seen below. Instructors have a unique email address but may have multiple IDs if they serve more than one school. The restriction on the file I'm submitting is only one ID per instructor. I want to find all cases such as hsmith@yahoo.com and give him the same ID in all obs. (It doesn't matter which one as long as he only has one.) Right now, I'm running a PROC FREQ with a TABLE statement of email*tchid/list and doing a visual inspection (not exactly an elegant solution.)

email tchid

jdoe@aol.com 01

sjones@hotmail.com 02

hsmith@yahoo.com 03

hsmith@yahoo.com 04

Tom · Posted 01-03-2012 02:20 PM

You could use SQL. For example you could decide to pick the smallest id for each email address.

(NOTE: are you sure that the different schools are using different ranges of id values? Otherwise you might have the same id for two different instructors.)

proc sql ;
  create table uniqueid as 
   select distinct email, min(tid) as tid 
   from mutlipleid
   group by email
  ;
quit;

View solution in original post

Tom · Posted 01-03-2012 02:20 PM

You could use SQL. For example you could decide to pick the smallest id for each email address.

(NOTE: are you sure that the different schools are using different ranges of id values? Otherwise you might have the same id for two different instructors.)

proc sql ;
  create table uniqueid as 
   select distinct email, min(tid) as tid 
   from mutlipleid
   group by email
  ;
quit;

art297 · Posted 01-03-2012 02:23 PM

or, if you prefer a datastep solution:

data have;

informat email $40.;

input email tchid $;

cards;

jdoe@aol.com 01

sjones@hotmail.com 02

hsmith@yahoo.com 03

hsmith@yahoo.com 04

;

proc sort data=have (drop=tchid) out=want;

by email;

run;

data want (drop=temp:);

set want;

by email;

if first.email then tempid+1;

tchid=put(tempid,z2.);

run;

That could, of course, be modified to retain some of the existing ids.

GreggB · Posted 01-03-2012 02:26 PM

this would assign new IDs, wouldn't it? I have to use the ones that are already in the student information system.

art297 · Posted 01-03-2012 02:32 PM

Easy to accomodate:

data have;

informat email $40.;

input email tchid $;

cards;

jdoe@aol.com 01

sjones@hotmail.com 02

hsmith@yahoo.com 03

hsmith@yahoo.com 04

;

proc sort data=have out=want;

by email;

run;

data want (drop=hold:);

set want;

by email;

retain holdid;

if first.email then holdid=tchid;

else tchid=holdid;

run;

GreggB · Posted 01-03-2012 02:27 PM

schools are using different ranges. they are assigned at a central location to ensure there's no overlap.

Linlin · Posted 01-03-2012 02:32 PM

you can get what you want by:

proc sort data=have out=want nodupkey;

by email;

run;

Linlin

finding duplicates

Re: finding duplicates

Re: finding duplicates

Re: finding duplicates

finding duplicates

finding duplicates

finding duplicates

Re: finding duplicates

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away