Help using Base SAS procedures

finding duplicates

Accepted Solution Solved
Reply
Super Contributor
Posts: 268
Accepted Solution

finding duplicates

i have data as seen below.  Instructors have a unique email address but may have multiple IDs if they serve more than one school.  The restriction on the file I'm submitting is only one ID per instructor.  I want to find all cases such as hsmith@yahoo.com and give him the same ID in all obs.  (It doesn't matter which one as long as he only has one.)  Right now, I'm running a PROC FREQ with a TABLE statement of email*tchid/list and doing a visual inspection (not exactly an elegant solution.)

email                            tchid

jdoe@aol.com                 01

sjones@hotmail.com       02

hsmith@yahoo.com         03

hsmith@yahoo.com         04


Accepted Solutions
Solution
‎01-03-2012 02:20 PM
Super User
Super User
Posts: 6,502

Re: finding duplicates

You could use SQL.  For example you could decide to pick the smallest id for each email address.

(NOTE: are you sure that the different schools are using different ranges of id values?  Otherwise you might have the same id for two different instructors.)

proc sql ;

  create table uniqueid as

   select distinct email, min(tid) as tid

   from mutlipleid

   group by email

  ;

quit;

View solution in original post


All Replies
Solution
‎01-03-2012 02:20 PM
Super User
Super User
Posts: 6,502

Re: finding duplicates

You could use SQL.  For example you could decide to pick the smallest id for each email address.

(NOTE: are you sure that the different schools are using different ranges of id values?  Otherwise you might have the same id for two different instructors.)

proc sql ;

  create table uniqueid as

   select distinct email, min(tid) as tid

   from mutlipleid

   group by email

  ;

quit;

PROC Star
Posts: 7,366

Re: finding duplicates

or, if you prefer a datastep solution:

data have;

  informat email $40.;

  input email tchid $;

  cards;

jdoe@aol.com             01

sjones@hotmail.com       02

hsmith@yahoo.com         03

hsmith@yahoo.com         04

;

proc sort data=have (drop=tchid) out=want;

  by email;

run;

data want (drop=tempSmiley Happy;

  set want;

  by email;

  if first.email then tempid+1;

  tchid=put(tempid,z2.);

run;

That could, of course, be modified to retain some of the existing ids.

Super Contributor
Posts: 268

finding duplicates

this would assign new IDs, wouldn't it?  I have to use the ones that are already in the student information system.

PROC Star
Posts: 7,366

finding duplicates

Easy to accomodate:

data have;

  informat email $40.;

  input email tchid $;

  cards;

jdoe@aol.com             01

sjones@hotmail.com       02

hsmith@yahoo.com         03

hsmith@yahoo.com         04

;

proc sort data=have out=want;

  by email;

run;

data want (drop=holdSmiley Happy;

  set want;

  by email;

  retain holdid;

  if first.email then holdid=tchid;

  else tchid=holdid;

run;

Super Contributor
Posts: 268

finding duplicates

schools are using different ranges.  they are assigned at a central location to ensure there's no overlap.

Super Contributor
Posts: 1,636

Re: finding duplicates

you can get what you want by:

proc sort data=have out=want nodupkey;

by email;

run;

Linlin

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 190 views
  • 0 likes
  • 4 in conversation