03-07-2016 03:39 PM
The title might be a little misleading, but I don't know how to word it correctly.
I'm working with higher education data on students majors. But due to error in the system, some students will show that they have the multiple majors in the same subjects,
e.g. Psychology BA
or dual major of the same subject but different levels
so I want a query to clean up those students to delete the Pre majors if they have a major of the same name
and retain BS if they have a BA of the same name
e.g. Psychology BA
what function, or proc should I use in this case.
Say I want to transform the data from
1 Psy BA
1 Pre Psy
1 Psy BS
2 Pre Social
1 Psy BS
Thanks everyon... this forum saves people's lives.....lol
03-07-2016 04:07 PM
Do you have to deal with not quite as closely related majors, such a Biology and PSY BS? Is the goal to reduce to a single major per studend or single-within-related?
By any chance is there a date associated with this data? If so, would a more recent date tend to be more accurate?
03-07-2016 04:21 PM
1) no I don't have to deal with non-related majors as the example you give.
2) No my goal is NOT to reduce to a single major.... my goal is to clean up the mistake registra has made.
When they activate a full major from pre-major, the pre-major is not being delete, or switching from BA to BS.
3) No date.... I wish I had though.
03-07-2016 06:12 PM
For your example this works.
data temp; set have; if upcase (scan(major, 1)) = 'PRE' then order=1; else if upcase (scan(major,-1)) = 'BA' then order=3; else if upcase (scan(major,-1)) = 'BS' then order=4; else order= 2; run; proc sort data=temp; by student_id order; run; data want; set temp; by student_id order; if last.student_id; drop order; run;
Some potential data issues that may mess with this: If PRE is appearing as "Pre-Med", if so modify the scan to include - in delimiters as scan(major,1,' -').
If BS or BA ever end up with Bachelor, or B.S. or B.A. instead of BS BA
MA and MS are not included but the pattern may be useable to modify the ORDER assignments by adding higher levels
03-07-2016 06:23 PM
while this works for the most part, it's eliminating all students with multiple pre-majors
student id major
1 pre history
1 pre business
1 pre psychology
I'm seeing all three were ordered to "1", and the output dataset only contains "pre psychology", due to the "if last student id "
However, as you can see, this student was not part of the error, he indeed has 3 pre-majors.
Whereas I want it to only clease people with major and pre-major of the SAME subject, or BA BS of the same subject
03-07-2016 06:53 PM
That's kind of why I asked about other majors.
It may be that you separate the data into two sets, one of majors you are concerned with and the other everything else.
data Concern TheRest;
if major in ('Pre Psy' 'Psy' 'Psy BA' 'Psy BS' 'Social' 'Pre Social' etc) then output Concern;
else output TheRest;
<the code above on Concern>
Are there more rules about majors involved yet?
Use the supplied code on the concerned set
Recombine the data.
03-08-2016 10:44 AM
I appreciate you taking the time to help, but that is not a solution.
For one I have a list of major that is too long to type in mannually.
Seondly, I can't even separate the data of concern even if I can type them all manually.
Because you see, not all the errors consist of clean forms of only major and pre-majors.
there are also combo cases like this
Namely, a student with errors on their psy major but he/she is also a legit pre-business, history, psychology triple major.
Your code won't work to clean off the pre-psy in this case.
03-08-2016 01:25 PM
Do you have a data set, or can you make one, with the problematic majors? If so, that can be used to extract the majors with problems.
This is starting sound like you may need to make something that provides a "with in group" list of majors and the faux-pas and process each one separately, ie the Psy related enteries, then the Social related entries. Which will entail significantly different logic. So, just how many actual "majors" are there that may have these errors? And do you have any idea how many students may be effected?
I also note that your are now spelling the majors differently, "pre-ps" vs "Pre Psy", than in the original post. Does this mean that there may also be differences is spelling to consider.
03-08-2016 03:23 PM
No, I have no idea how many people are like this, and
no, I don't know how to make a dataset with JUST the problem ones....
and again NO, this can affects students of any major, isn't restricted to a particular subject..
The spelling "ps:, is a typo..
I don't know how else to clarify this. the example that will entail all the problem will be this kind of student
1 Psychology BA
1 Psychology BS
1 Pre Psychology
1 Pre History
When in reality this student was a Psychology BS, Pre History, and Business triple major. But when during his academic career when he changes major from pre major to a BA the pre major wasn't deleted, and when he changes from a BA to BS, BA wasn't deleted.. And I DON't have a Date related to when those changes occurs.
Eventually I want the correct result to be this
1 Psychology BS
1 Pre History
03-08-2016 10:47 AM
In other words, I need a proc that can scan their major filed and identify the observarions by ID that is the same word with a "pre" prefix or mismatched suffix "BA" or "BS".
Is there no way to do this?
03-08-2016 05:15 PM
I am include some code that given a dataset with the base names of the majors and the majors with prefixes or suffixes as listed seems to do the reduction requested.
As noted before any pattern other than those explicitly described will require additional coding such as MA or MS or PHD. This type of change should be relatively easy to to. If the major itself has spelling issues Psy vs Psych for instance, I'd wash my hands of the whole affair. If you have other prefixes than "Pre-" you fix.
This uses Call Execute which ineffect stacks up code to execute. I would suggest testing on a subset of 4 or 5 majors in the MajorList data set to see if things are working.
This is not exactly a trivial exercise in string processing. I am sure there are other approaches but imposing a relatively arbitrary heirarchy such as BS > BA isn't generally simple.
/* a list of the base value of majors*/ data majorlist; informat major $25.; input major &; major = upcase(major); datalines; Psychology Psychiatry history Sociology Mathematics Literature Art History Physics ; run; /*an example data set with some mixes of Pre-, major, BA BS*/ data majors; informat Student_id best4. major $15.; input Student_id Major &; datalines; 1 Pre-Literature 1 Literature 1 Literature BA 1 Physics 2 Sociology 3 Pre-Literature 3 Literature 3 Literature BA 3 Psychiatry 3 Mathematics 4 Physics 5 Pre-Sociology 5 Sociology 6 Mathematics 6 Mathematics BA 7 Psychology 7 Art History 7 Mathematics 8 Pre-Physics 8 Physics 8 Physics BA 9 Pre-History 9 Pre-Sociology 9 Sociology 9 Pre-Psychology 9 Psychology 9 Psychology BA 10 Art History 11 Physics 12 History 12 History BA 12 Psychology 13 Pre-Art History 13 Art History 13 Art History BA 14 Literature 14 Literature BA 14 Literature BS 14 Pre-Sociology 14 Sociology 15 Pre-Literature 15 Literature 15 Literature BA 16 Pre-Psychology 16 Psychology 17 History 17 History BA 18 Sociology 19 Pre-Literature 19 Literature 19 Literature BA 20 Physics ; run; /*builds subsets based on the "stem" of the major*/ data _null_; set majorlist; setNum +1; tname = cats("_xm",SetNum); call execute("data "||tname||";"); call execute(" set majors;"); wstr = catx(' '," where upcase(major) in (",Catq('2at',cats("PRE-",upcase(major)),upcase(major),Catx(' ',upcase(major),'BA'),Catx(' ',upcase(major),'BS')),");"); call execute (wstr); /* add the "order" variable*/ call execute (" if upcase (scan(major, 1,'-')) = 'PRE' then order=1;"); call execute (" else if upcase (scan(major,-1)) = 'BA' then order=3;"); call execute (" else if upcase (scan(major,-1)) = 'BS' then order=4;"); call execute (" else order= 2;"); call execute ("run;"); /* sort the temporary data set*/ call execute ("proc sort data="||tname||";"); call execute (" by student_id order;"); call execute ("run;"); /* get to last ordered from the major set normally I don't use the data setname; set setname; but this is to reduce the number of datasets */ call execute ("data "||tname||";"); call execute (" set "||tname||";"); call execute (" by student_id order;"); call execute (" if last.student_id;"); call execute (" drop order;"); call execute ("run;"); run; /* at this point we should have a bunch of temporary files name _xm1 _xm2 combine them and then sort by student_id */ data newmajors; set _xm: ; run; proc sort data=newmajors; by student_id major; run;
I included my own example data set of student an major that matches the majors in the matching MajorList dataset. If your processing misses any major in your student data for any reason (spelling is a likely source) then those records in the student data will be missing in the final data set I called newmajors. I have not tested what happens if there is a major in the majorlist set without any majors in the student data but I suspect you'll be okay (unless the reason is a misspelling in the majorlist...)
03-14-2016 04:46 PM
Thanks a bunch for this code.
It seems to be working, but not quite there.
I ran it with my data and it's still throwing out people.
I think maybe it's because there are BFA, BMU type of degrees also.
Here I have attached the data I have, you can take a look what I mean.
If you can help me to the end I'll appreciate it eternally. This is really some important to what i'm doing.
For the list of majors, I simply did a proc freq(or Proc SQL) to see a list of all majors and then pasted to your datalines step, after removing the suffix of BA, BS, BMU... etc.