06-12-2017 07:42 PM
The following is a simplified example of the structure of my data:
STUDENT_ID SCHOOL_ID_2009 SCHOOL_ID_2010
001 002 002
002 002 002
003 002 002
004 002 004
005 003 005
006 003 005
007 003 005
008 003 006
In my real data, I have about 8 million students and about 30 thousand schools. I want to identify students trajectories across time. However, there are some problems with school id codes. Some schools change their ID for administrative reasons (e.g. merges and acquisitions, changes of name, owner, etc.). In the data example I provided above, I can find the school that has the 2009 ID "002" in the 2010 dataset, although one of its students moved to school "004". However, the school with the 2009 ID "003" is not found in the 2010 dataset. In my example, the most likely new ID for this school is "005", since 3/4 of the students of school "003" moved to school "005".
Basically, my problem is: how to identify what are the most likely new IDs for schools that have disappeared from one year to another? I thought that I could calculate what is the percentage of students from each one of the 2009's school that appears in each one of 2010's schools. Do anyone have an idea of how to do this? I need to identify which students are really changing schools and which students are kept in the same school, but the school changed its ID. In cases that are likely that the school had changed its ID, I will assign the same school ID in both years.
If anyone have a solution for this problem, specially in PROC SQL, I will be really thankful.
I am using SAS Enterprise Guide version 4.2
Thank you everyone,
06-12-2017 08:30 PM
Sorry if it was confusing. I will try to explain it step by step:
1. I had originally two datasets, one for 2009 and another for 2010. Both of them provided students' and schools' IDs.
2. When I join both datasets on schools ID, I noticed that some of the schools had disappeared from 2009 to 2010. However, when I join the datasets on student ID, I noticed that in reality some schools just have changed their IDs from one year to another. Some of the schools even have the same name, the same location, the same students, but different IDs. The data example I provided in the previous post is already the result of merging both tables by student ID.
3. Then it comes the problem I presented above. How to identify schools that are the same but have different IDs in two different years? I basically want to "correct" the school ID variable across years. One way I thought I could do this is by comparing their students. However, I have over 30,000 schools and 8 million students, so I need to automate this solution.
4. I have more information (variables) than I presented. My idea is to exclude all students that had graduated, died or droped out in the middle of 2009 (I do have this information), and then check if two schools with different IDs (in two different years) have the same students. I thought about setting the thresholds of 5 and of 80%: that is, if a school in 2009 have more than 5 students and more than 80% of these students are in only one school with a different ID in 2010 that is because these schools are the same.
06-12-2017 09:09 PM
Below code can help you to understand which student moved. unfortunaley I could not understand
I basically want to "correct" the school ID variable across years means. for your 4th point, if you show some data assigning a threshhold value should not be diffcult.
/*finding number of students to moved from one school to other*/ proc sql; create table numberofrecords as select * from have a where exists(select * from have b where a.student_id =b.student_id and a.school_id_2009 ^= b.school_id_2010); quit;
06-13-2017 08:50 AM
Hi kiranv_, thank you very much for your suggestion. I will try it today and post a feedback later.
The idea of "correcting" IDs is perhaps better explained in my reply to Reeza bellow. The roots of the problem I am facing is that sometimes the same school receives different IDs from one year to another. That is why I am trying to "correct" them, that is, assign the same ID to the same school across years by comparing their students.
06-12-2017 09:19 PM
Assuming you are dealing with two years only, create A new variable:
SCHOOLS = CATX('-',SCHOOL_ID_2009, SCHOOL_ID_2010);
and a FLAG = 1 where school codes remain the same, otherwise FLAG=0
then compute frequencies of SCHOOL_ID_2009 * SCHOOLS where FLAG=0 (school code changed);
I hope that will give you a clue to solution.
06-12-2017 09:49 PM
This isn't a problem you should solve with a program. You should be able to ask someone about the data and why it's in this format.
Did the school change ID's or did a whole bunch of students move from a Elementary to Jr High school?
Differentiating those is going to be difficult. Or maybe school boundaries were redrawn and a bunch of kids had to change schools?
I would suggest getting some other information that would allow you to make these assumptions and mapping, either a school location data set, each ID maps to the location and if the location is the same, the school is the same. Or I would assume that someone has the mapping ID's, this is something that's usually tracked.
Knowing when to use data vs email to solve a problem is a good idea, this is better solved via email.
You'll have no trust in the results otherwise.
06-13-2017 08:46 AM - edited 06-13-2017 08:58 AM
Hi Reeza, thank you for the advice. But in fact I already did it. I had talked in person many times with the team responsible for the data. And they told me that the "school ID problem" is basically generated because schools do have many IDs in the Ministry of Education. This problem has a long origin, when a new ID was assigned for the same school every time it took part in a new policy of the Ministry. However, this problem still reproducing itself because every year, when filling the school census, the schools still have many ID options to choose.
Another issue is that I am comparing only students within the same level of education. So they did not change for another school level.
But it is true that sometimes the school might have closed and their students spread across different schools. That is why I thought about the 5 and 80% threshold. If a school with the ID "005" in 2010 have at least 5 and 80% of the students a school with the ID "003" in 2009, that is because they are probably the same school, or at least I can consider them as if they were the same, for the purpose of analysing students trajectory. Well, that is the idea that I had.
06-13-2017 04:19 PM
I certainly sympothize with your school information. I only get to deal with about 800 or so but have to check every year for changed identifiers, grades of attendance and status (charter or not). It does not help at all that our Deptartment of Education has 3 different ID schemes and will keep one value in one scheme while changing it in one of the others.
You mention "level of education" or similar. I would be careful of assumptions on that as I had some funky things in my data until I found that some schools had split to only provide grades 1-3 droppin the traditional associate 4-6 grades. Others split at grade 4.
And then the ever popular keep the same school name but change the identifiers, location and grades served at the same time.
I try to start with the annual enrollment-by-building information for each year and look for the changes. With identified elements I can often ask a question that is targetted to someone in the school district such as "did School X become School Y" or "It looks like School Y may have dropped grade x. is that correct?"
This could be more of a GIS question as school service boundaries also change. So if you have the student address, age and the service area maps then a better answer may be possible.
06-13-2017 04:40 PM
'Many IDs to choose' what do you mean by this. I really do think the dept does have this data, in a database somewhere. Finding the person who has it is probably difficult though.
Can you have duplicate IDs? Ie is 0003 in system 1 guaranteed to be the same as 0003 in system 2? I suspect that's also a problem you may run into with this analysis.