DATA Step, Macro, Functions and more

search match character on list for false duplicates

Reply
Contributor
Posts: 25

search match character on list for false duplicates

The title might be a little misleading, but I don't know how to word it correctly.

 

I'm working with higher education data on students majors. But due to error in the system, some students will show that they have the multiple majors in the same subjects,

e.g. Psychology BA
Psychology BS

 

or dual major of the same subject but different levels

 

e.g. Psychology
Pre Psychology

 

so I want a query to clean up those students to delete the Pre majors if they have a major of the same name

e.g. Psychology
Pre Psychology

and retain BS if they have a BA of the same name

e.g. Psychology BA
Psychology BS.

 

what function, or proc should I use in this case. 

Say I want to transform the data from 

 

student_id           Major

1                           Psy BA

1                         Pre Psy

1                          Psy BS

2                             Social

2                             Pre Social

 

to

student_id           Major

 

1                          Psy BS

2                             Social

 

Thanks everyon... this forum saves people's lives.....lol

 

Super User
Posts: 11,343

Re: search match character on list for false duplicates

Do you have to deal with not quite as closely related majors, such a Biology and PSY BS? Is the goal to reduce to a single major per studend or single-within-related?

 

 

By any chance is there a date associated with this data? If so, would a more recent date tend to be more accurate?

Contributor
Posts: 25

Re: search match character on list for false duplicates

1) no I don't have to deal with non-related majors as the example you give.

2) No my goal is NOT to reduce to a single major.... my goal is to clean up the mistake registra has made. 

  When they activate a full major from pre-major, the pre-major is not being delete, or switching from BA to BS. 

3) No date.... I wish I had though.

Super User
Posts: 11,343

Re: search match character on list for false duplicates

For your example this works.

data temp;
   set have;
   if upcase (scan(major, 1)) = 'PRE' then order=1;
   else if upcase (scan(major,-1)) = 'BA'  then order=3;
   else if upcase (scan(major,-1)) = 'BS'  then order=4;
   else order= 2;
run;

proc sort data=temp;
   by student_id order;
run;

data want;
   set temp;
   by student_id order;
   if last.student_id;
   drop order;
run; 

Some potential data issues that may mess with this: If PRE is appearing as "Pre-Med", if so modify the scan to include - in delimiters as scan(major,1,' -').

 

If BS or BA ever end up with Bachelor, or B.S. or B.A. instead of BS BA

MA and MS are not included but the pattern may be useable to modify the ORDER assignments by adding higher levels

Contributor
Posts: 25

Re: search match character on list for false duplicates

 

thank you.

while this works for the most part, it's eliminating all students with multiple pre-majors 

 

e.g. 

 

student id   major

1                 pre history

1                 pre business

1                 pre psychology

 

I'm seeing all three were ordered to "1", and the output dataset only contains "pre psychology", due to the "if last student id "

 

However, as you can see, this student was not part of the error, he indeed has 3 pre-majors. 

Whereas I want it to only clease people with major and pre-major of the SAME subject, or BA BS of the same subject

Super User
Posts: 11,343

Re: search match character on list for false duplicates

That's kind of why I asked about other majors.

 

It may be that you separate the data into two sets, one of majors you are concerned with and the other everything else.

Something like:

 

data Concern TheRest;

   Set have;

   if major in ('Pre Psy' 'Psy' 'Psy BA' 'Psy BS' 'Social' 'Pre Social' etc) then output Concern;

   else output TheRest;

run;

<the code above on Concern>

 

data Combined;

   Set TheRest

          Want

   ;

run;

 

Are there more rules about majors involved yet?

Use the supplied code on the concerned set

Recombine the data.

Contributor
Posts: 25

Re: search match character on list for false duplicates

I appreciate you taking the time to help, but that is not a solution. 

For one I have a list of major that is too long to type in mannually. 

Seondly, I can't even separate the data of concern even if I can type them all manually. 

 

Because you see, not all the errors consist of clean forms of only major and pre-majors.

like this 

student_id             major

1                             psy

1                             pre-psy

 

there are also combo cases like this

 

student_id             major

1                             psy

1                             pre-psy

1                             pre-business

1                             history

 

Namely, a student with errors on their psy major but he/she is also a legit pre-business, history, psychology triple major. 

Your code won't work to clean off the pre-psy in this case.

 

Super User
Posts: 11,343

Re: search match character on list for false duplicates

Do you have a data set, or can you make one, with the problematic majors? If so, that can be used to extract the majors with problems.

This is starting sound like you may need to make something that provides a "with in group" list of majors and the faux-pas and process each one separately, ie the Psy related enteries, then the Social related entries. Which will entail significantly different logic. So, just how many actual "majors" are there that may have these errors? And do you have any idea how many students may be effected?

 

I also note that your are now spelling the majors differently, "pre-ps" vs "Pre Psy", than in the original post. Does this mean that there may also be differences is spelling to consider.

Contributor
Posts: 25

Re: search match character on list for false duplicates

No, I have no idea how many people are like this, and

no, I don't know how to make a dataset with JUST the problem ones....

and again NO, this can affects students of any major, isn't restricted to a particular subject..

 

The spelling "ps:, is a typo..

 

I don't know how else to clarify this.  the example that will entail all the problem will be this kind of student

 

id    Major

1     Psychology BA

1     Psychology BS

1     Pre Psychology

1     Pre History

1     Business

 

When in reality this student was a Psychology BS, Pre History, and Business triple major. But when during his academic career when he changes major from pre major to a BA the pre major wasn't deleted, and when he changes from a BA to BS, BA wasn't deleted.. And I DON't have a Date related to when those changes occurs. 

 

Eventually I want the correct result to be this

 

id    Major

1     Psychology BS

1     Pre History

1     Business

Contributor
Posts: 25

Re: search match character on list for false duplicates

In other words, I need a proc that can scan their major filed and identify the observarions by ID that is the same word with a "pre" prefix or mismatched suffix "BA" or "BS".

 

Is there no way to do this? 

 

 

Super User
Posts: 11,343

Re: search match character on list for false duplicates

I am include some code that given a dataset with the base names of the majors and the majors with prefixes or suffixes as listed seems to do the reduction requested.

As noted before any pattern other than those explicitly described will require additional coding such as MA or MS or PHD. This type of change should be relatively easy to to. If the major itself has spelling issues Psy vs Psych for instance, I'd wash my hands of the whole affair. If you have other prefixes than "Pre-" you fix.

This uses Call Execute which ineffect stacks up code to execute. I would suggest testing on a subset of 4 or 5 majors in the MajorList data set to see if things are working.

This is not exactly a trivial exercise in string processing. I am sure there are other approaches but imposing a relatively arbitrary heirarchy such as BS > BA isn't generally simple.

/* a list of the base value of majors*/
data majorlist;
   informat major $25.;
   input major &;
   major = upcase(major);
datalines;
Psychology
Psychiatry
history
Sociology
Mathematics
Literature
Art History
Physics
;
run;

/*an example data set with some mixes of Pre-, major, BA BS*/
data majors;
   informat Student_id best4. major $15.;
   input Student_id Major &;
datalines;
1 Pre-Literature 
1 Literature 
1 Literature BA 
1 Physics 
2 Sociology 
3 Pre-Literature 
3 Literature 
3 Literature BA 
3 Psychiatry 
3 Mathematics 
4 Physics 
5 Pre-Sociology 
5 Sociology 
6 Mathematics 
6 Mathematics BA 
7 Psychology 
7 Art History 
7 Mathematics 
8 Pre-Physics 
8 Physics 
8 Physics BA 
9 Pre-History 
9 Pre-Sociology 
9 Sociology 
9 Pre-Psychology 
9 Psychology 
9 Psychology BA 
10 Art History 
11 Physics 
12 History 
12 History BA 
12 Psychology 
13 Pre-Art History 
13 Art History 
13 Art History BA 
14 Literature 
14 Literature BA 
14 Literature BS 
14 Pre-Sociology 
14 Sociology 
15 Pre-Literature 
15 Literature 
15 Literature BA 
16 Pre-Psychology 
16 Psychology 
17 History 
17 History BA 
18 Sociology 
19 Pre-Literature 
19 Literature 
19 Literature BA 
20 Physics 
;
run;
 
/*builds subsets based on the "stem" of the major*/ 
data _null_;
   set majorlist;
   setNum +1;
   tname = cats("_xm",SetNum);
   call execute("data "||tname||";");
   call execute("   set majors;");
   wstr = catx(' ',"   where upcase(major) in (",Catq('2at',cats("PRE-",upcase(major)),upcase(major),Catx(' ',upcase(major),'BA'),Catx(' ',upcase(major),'BS')),");");
   call execute (wstr);
   /* add the "order" variable*/
   call execute ("   if upcase (scan(major, 1,'-')) = 'PRE' then order=1;");
   call execute ("   else if upcase (scan(major,-1)) = 'BA'  then order=3;");
   call execute ("   else if upcase (scan(major,-1)) = 'BS'  then order=4;");
   call execute ("   else order= 2;");
   call execute ("run;");
   /* sort the temporary data set*/
   call execute ("proc sort data="||tname||";");
   call execute ("   by student_id order;");
   call execute ("run;");
   /* get to last ordered from the major set
      normally I don't use the data setname; set setname; but this is to reduce
      the number of datasets
   */
   call execute ("data "||tname||";");
   call execute ("   set "||tname||";");
   call execute ("   by student_id order;");
   call execute ("   if last.student_id;");
   call execute ("   drop order;");
   call execute ("run;"); 
run;
/* at this point we should have a bunch of temporary files name _xm1 _xm2
   combine them and then sort by student_id
*/
data newmajors;
   set _xm: ;
run;

proc sort data=newmajors;
   by student_id major;
run;

I included my own example data set of student an major that matches the majors in the matching MajorList dataset. If your processing misses any major in your student data for any reason (spelling is a likely source) then those records in the student data will be missing in the final data set I called newmajors. I have not tested what happens if there is a major in the majorlist set without any majors in the student data but I suspect you'll be okay (unless the reason is a misspelling in the majorlist...)

 

Contributor
Posts: 25

Re: search match character on list for false duplicates

Thanks a bunch for this code.

 

It seems to be working, but not quite there. 

I ran it with my data and it's still throwing out people. 

 

I think maybe it's because there are BFA, BMU type of degrees also. 

 

Here I have attached the data I have, you can take a look what I mean. 

 

If you can help me to the end I'll appreciate it eternally. This is really some important to what i'm doing. 

 

For the list of majors, I simply did a proc freq(or Proc SQL) to see a list of all majors and then pasted to your datalines step, after removing the suffix of BA, BS, BMU... etc.

 

 

Ask a Question
Discussion stats
  • 11 replies
  • 391 views
  • 0 likes
  • 2 in conversation