About luizzalaf

luizzalaf · ‎06-13-2017

Hi kiranv_, thank you very much for your suggestion. I will try it today and post a feedback later. The idea of "correcting" IDs is perhaps better explained in my reply to Reeza bellow. The roots of the problem I am facing is that sometimes the same school receives different IDs from one year to another. That is why I am trying to "correct" them, that is, assign the same ID to the same school across years by comparing their students.

luizzalaf · ‎06-13-2017

Hi Shmuel, thank you for your suggestion. I will try it today, and post a feedback later.

luizzalaf · ‎06-13-2017

Hi Reeza, thank you for the advice. But in fact I already did it. I had talked in person many times with the team responsible for the data. And they told me that the "school ID problem" is basically generated because schools do have many IDs in the Ministry of Education. This problem has a long origin, when a new ID was assigned for the same school every time it took part in a new policy of the Ministry. However, this problem still reproducing itself because every year, when filling the school census, the schools still have many ID options to choose. Another issue is that I am comparing only students within the same level of education. So they did not change for another school level. But it is true that sometimes the school might have closed and their students spread across different schools. That is why I thought about the 5 and 80% threshold. If a school with the ID "005" in 2010 have at least 5 and 80% of the students a school with the ID "003" in 2009, that is because they are probably the same school, or at least I can consider them as if they were the same, for the purpose of analysing students trajectory. Well, that is the idea that I had.

luizzalaf · ‎06-12-2017

Sorry if it was confusing. I will try to explain it step by step: 1. I had originally two datasets, one for 2009 and another for 2010. Both of them provided students' and schools' IDs. 2. When I join both datasets on schools ID, I noticed that some of the schools had disappeared from 2009 to 2010. However, when I join the datasets on student ID, I noticed that in reality some schools just have changed their IDs from one year to another. Some of the schools even have the same name, the same location, the same students, but different IDs. The data example I provided in the previous post is already the result of merging both tables by student ID. 3. Then it comes the problem I presented above. How to identify schools that are the same but have different IDs in two different years? I basically want to "correct" the school ID variable across years. One way I thought I could do this is by comparing their students. However, I have over 30,000 schools and 8 million students, so I need to automate this solution. 4. I have more information (variables) than I presented. My idea is to exclude all students that had graduated, died or droped out in the middle of 2009 (I do have this information), and then check if two schools with different IDs (in two different years) have the same students. I thought about setting the thresholds of 5 and of 80%: that is, if a school in 2009 have more than 5 students and more than 80% of these students are in only one school with a different ID in 2010 that is because these schools are the same.

luizzalaf · ‎06-12-2017

The following is a simplified example of the structure of my data: STUDENT_ID SCHOOL_ID_2009 SCHOOL_ID_2010 001 002 002 002 002 002 003 002 002 004 002 004 005 003 005 006 003 005 007 003 005 008 003 006 In my real data, I have about 8 million students and about 30 thousand schools. I want to identify students trajectories across time. However, there are some problems with school id codes. Some schools change their ID for administrative reasons (e.g. merges and acquisitions, changes of name, owner, etc.). In the data example I provided above, I can find the school that has the 2009 ID "002" in the 2010 dataset, although one of its students moved to school "004". However, the school with the 2009 ID "003" is not found in the 2010 dataset. In my example, the most likely new ID for this school is "005", since 3/4 of the students of school "003" moved to school "005". Basically, my problem is: how to identify what are the most likely new IDs for schools that have disappeared from one year to another? I thought that I could calculate what is the percentage of students from each one of the 2009's school that appears in each one of 2010's schools. Do anyone have an idea of how to do this? I need to identify which students are really changing schools and which students are kept in the same school, but the school changed its ID. In cases that are likely that the school had changed its ID, I will assign the same school ID in both years. If anyone have a solution for this problem, specially in PROC SQL, I will be really thankful. I am using SAS Enterprise Guide version 4.2 Thank you everyone, Kind regards, Luiz

Online Status	Offline
Date Last Visited	‎06-13-2017 09:33 AM

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

Re: How do I identify members of one group within other groups?

How do I identify members of one group within other groups?