I’m hoping someone can help with a data organization task. I have nine data sets, one for each year from 2002 through 2010. Each data set includes the same 13 variables. Each column contains a variable; each row represents a different student. These are very large files with many students.
The 13 variables include one unique ID (StdPseudoId) and 12 fixed effects (e.g. gender, date of birth, home language). I would like to create one data set that includes all students and their respective values across the 13 variables.
What makes it just a little complicated is that many students are represented across multiple files. For example, a student might be in the files for years 2002-2007, but not 2008-2010. Another student might in the files for years 2005 and 2006 only. Also, some students may have missing values due to data entry errors. For example, a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006. So, even though the values of any given variable do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.
So, what I have:
Nine data sets:
1) File name: Demog01_02
Variable list:
1 | StdPseudoId | Number |
2 | LangClsCode02 | Character |
3 | ELIdentDate02 | Character |
4 | StdELRedesigDate02 | Character |
5 | StdDob02 | Character |
6 | Gender02 | Character |
7 | Ethnicity02 | Character |
8 | HomeLang02 | Character |
9 | FRLunch02 | Character |
10 | ParentEduLevel02 | Character |
11 | US1stSchoolDate02 | Character |
12 | CA1stSchoolDate02 | Character |
13 | GradeFirstEnroll02 | Number |
2) File name: Demog02_03
Variable list:
1 | StdPseudoId | Number |
2 | LangClsCode03 | Character |
3 | ELIdentDate03 | Character |
4 | StdELRedesigDate03 | Character |
5 | StdDob03 | Character |
6 | Gender03 | Character |
7 | Ethnicity03 | Character |
8 | HomeLang03 | Character |
9 | FRLunch03 | Character |
10 | ParentEduLevel03 | Character |
11 | US1stSchoolDate03 | Character |
12 | CA1stSchoolDate03 | Character |
13 | GradeFirstEnroll03 | Number |
And the same structure for each data set, through 2010:
3) File name: Demog03_04
4) File name: Demog04_05
5) File name: Demog05_06
6) File name: Demog06_07
7) File name: Demog07_08
😎 File name: Demog08_09
9) File name: Demog09_10
What I want:
One data set with the same structure as the original data sets, but created with information from all nine data sets:
StdPseudoId | LangClsCode | ELIdentDate | StdELRedesigDate | StdDob | Gender | Ethnicity | HomeLang | FRLunch | ParentEduLevel | US1stSchoolDate | CA1stSchoolDate | GradeFirstEnroll |
100001235 | ||||||||||||
100002227 | ||||||||||||
100008854 | ||||||||||||
100022631 | ||||||||||||
100095453 | ||||||||||||
100506481 | ||||||||||||
148260057 | ||||||||||||
etc. |
Basically you just need to SET the data sets and rename the variables.
The INDSNAME option creates a variable, here named DataSource, which would contain the name of the input data set for each record. Remove if not needed.
data want;
set
Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate
StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob
Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang
FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate
CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)
INDSNAME = DataSource)
Demog02_03 ( rename=( LangClsCode03 = LangClsCode ELIdentDate03 = ELIdentDate
StdELRedesigDate03 = StdELRedesigDate StdDob03 = StdDob
Gender03 = Gender Ethnicity03 = Ethnicity HomeLang03 = HomeLang
FRLunch03 = FRLunch ParentEduLevel03 = ParentEduLevel US1stSchoolDate03 = US1stSchoolDate
CA1stSchoolDate03 = CA1stSchoolDate GradeFirstEnroll03 = GradeFirstEnroll)
INDSNAME = DataSource)
/* repeat for each of the data sets, using the 04 05 etc variables with each dataset*/
; /* notice this is the end of the SET statement*/
run;
Thanks very much for your response. I am receiving an error:
Invalid option name INDSNAME.
And the resulting file, want, has not variables, no observations. Any thoughts?
thanks again.
I suspect you need to move the indsname = DataSource option to outside of the dataset option parenthesis and only use it once. indsname is a set statement option. Try something like this:
Set dataset1(rename = (..........))
dataset2(rename = (..........))
.
.
datasetn(rename = (...........)) indsname = DataSource
;
You might consider using KSharp's sql coding. The only downside to that is the need for a column within the individual datasets which can identify the year. You might have that (I did not do a review of the structure of your data). If that is not available then for each of the select statements in the sql, you would need to add field to which a constant is applied. Using KSharp's code, then something like:
proc sql;
create table want as
select LangClsCode02 as LangClsCode,
ELIdentDate02 as ELIdentDate,
StdELRedesigDate02 as StdELRedesigDate,
StdDob02 as StdDob,
Gender02 as Gender ,
Ethnicity02 as Ethnicity ,
HomeLang02 as HomeLang,
FRLunch02 as FRLunch,
ParentEduLevel02 as ParentEduLevel ,
US1stSchoolDate02 as US1stSchoolDate,
CA1stSchoolDate02 as CA1stSchoolDate,
GradeFirstEnroll02 as GradeFirstEnroll
,2002 as year
from Demog01_02
union
select *, 2003 as year from Demog02_03
union
select *,2004 as year from Demog03_04
union
..........
;
quit;
Ballard and Larry,
Thank you very much. When I moved INDSNAME = datasource to outside the dataset option parenthesis, the code ran without error. The resulting dataset had the correct variables. However, the resulting file contained multiple observations for each individual. So if an individual was included in the original datasets for 2001, 2002, and 2003, than that individual was repeated three times in the new dataset. An individual that was included in 8 of the 9 original datasets, was repeated eight times in the new dataset. Also, there was no variable ‘datasource.’
Again, I am trying to represent each individual only one time in the new dataset. Any thoughts?
Thanks again.
Ksharp and Larry,
Thank you for your help. The files do not contain individual columns that identify the year (please see my original post that describes the data), so I added the year in the select statement as you suggested. I received an error message. The code and log are below:
proc sql;
create table want2 as
select LangClsCode02 as LangClsCode,
ELIdentDate02 as ELIdentDate,
StdELRedesigDate02 as StdELRedesigDate,
StdDob02 as StdDob,
Gender02 as Gender,
Ethnicity02 as Ethnicity,
HomeLang02 as HomeLang,
FRLunch02 as FRLunch,
ParentEduLevel02 as ParentEduLevel,
US1stSchoolDate02 as US1stSchoolDate,
CA1stSchoolDate02 as CA1stSchoolDate,
GradeFirstEnroll02 as GradeFirstEnroll,
2002 as year
from temp01_02
union
select *, 2003 as year from temp02_03
union
select *, 2004 as year from temp03_04
union
select *, 2005 as year from temp04_05
union
select *, 2006 as year from temp05_06
union
select *, 2007 as year from temp06_07
union
select *, 2008 as year from temp07_08
union
select *, 2009 as year from temp08_09
union
select *, 2010 as year from temp09_10
union
;
quit;
Here is the log:
1668 proc sql;
1669
1670 create table want2 as
1671 select LangClsCode02 as LangClsCode,
1672 ELIdentDate02 as ELIdentDate,
1673 StdELRedesigDate02 as StdELRedesigDate,
1674 StdDob02 as StdDob,
1675 Gender02 as Gender,
1676 Ethnicity02 as Ethnicity,
1677 HomeLang02 as HomeLang,
1678 FRLunch02 as FRLunch,
1679 ParentEduLevel02 as ParentEduLevel,
1680 US1stSchoolDate02 as US1stSchoolDate,
1681 CA1stSchoolDate02 as CA1stSchoolDate,
1682 GradeFirstEnroll02 as GradeFirstEnroll,
1683 2002 as year
1684 from temp01_02
1685 union
1686 select *, 2003 as year from temp02_03
1687 union
1688 select *, 2004 as year from temp03_04
1689 union
1690 select *, 2005 as year from temp04_05
1691 union
1692 select *, 2006 as year from temp05_06
1693 union
1694 select *, 2007 as year from temp06_07
1695 union
1696 select *, 2008 as year from temp07_08
1697 union
1698 select *, 2009 as year from temp08_09
1699 union
1700 select *, 2010 as year from temp09_10
1701 union
1702 ;
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: (, ALL, CORR, CORRESPONDING, SELECT.
ERROR 76-322: Syntax error, statement will be ignored.
1703 quit;
NOTE: The SAS System stopped processing this step because of errors.
Hi again Ballard and Larry,
My apologies! I realized I made a mistake that might have caused the problem I described in the response I just sent to you. The original nine datasets each include 3 additional variables that DO change with every year. They are: SchCode, SchName, and Grade. For each year, I want to keep these variables, because they change. I wasn't thinking about how those additional variables would cause the individuals to be represented more than once in the new dataset. Could you please help me modify your code to account for all the variables?
To clarify, the nine datasets look like this:
1) File name: Demog01_02
Variable list:
1 | StdPseudoId | Numeric |
2 | SchCode02 | Numeric |
3 | SchName02 | Character |
4 | Grade02 | Character |
5 | LangClsCode02 | Character |
6 | ELIdentDate02 | Character |
7 | StdELRedesigDate02 | Character |
8 | StdDob02 | Character |
9 | Gender02 | Character |
10 | Ethnicity02 | Character |
11 | HomeLang02 | Character |
12 | FRLunch02 | Character |
13 | ParentEduLevel02 | Character |
14 | US1stSchoolDate02 | Character |
15 | CA1stSchoolDate02 | Character |
16 | GradeFirstEnroll02 | Numeric |
2) File name: Demog02_03
Variable list:
1 | StdPseudoId | Numeric |
2 | SchCode03 | Numeric |
3 | SchName03 | Character |
4 | Grade03 | Character |
5 | LangClsCode03 | Character |
6 | ELIdentDate03 | Character |
7 | StdELRedesigDate03 | Character |
8 | StdDob03 | Character |
9 | Gender03 | Character |
10 | Ethnicity03 | Character |
11 | HomeLang03 | Character |
12 | FRLunch03 | Character |
13 | ParentEduLevel03 | Character |
14 | US1stSchoolDate03 | Character |
15 | CA1stSchoolDate03 | Character |
16 | GradeFirstEnroll03 | Numeric |
And the same structure for each data set, through 2010:
3) File name: Demog03_04
4) File name: Demog04_05
5) File name: Demog05_06
6) File name: Demog06_07
7) File name: Demog07_08
😎 File name: Demog08_09
9) File name: Demog09_10
What I want:
One data set with the same structure as the original data sets, but created with information from all nine data sets (keeping ALL SchCode, SchName, and Grade variables):
StdPseudoId | SchCode02 | SchName02 | Grade02 | SchCode03 | SchName03 | Grade03 | SchCode04 | SchName04 | Grade04 | SchCode05 | SchName05 | Grade05 | SchCode06 | SchName06 | Grade06 | SchCode07 | SchName07 | Grade07 | SchCode08 | SchName08 | Grade08 | SchCode09 | SchName09 | Grade09 | SchCode10 | SchName10 | Grade10 | LangClsCode | ELIdentDate | StdELRedesigDate | StdDob | Gender | Ethnicity | HomeLang | FRLunch | ParentEduLevel | US1stSchoolDate | CA1stSchoolDate | GradeFirstEnroll |
100001235 | |||||||||||||||||||||||||||||||||||||||
100002227 | |||||||||||||||||||||||||||||||||||||||
100008854 | |||||||||||||||||||||||||||||||||||||||
100022631 | |||||||||||||||||||||||||||||||||||||||
100095453 | |||||||||||||||||||||||||||||||||||||||
100506481 | |||||||||||||||||||||||||||||||||||||||
148260057 | |||||||||||||||||||||||||||||||||||||||
etc. |
This is significantly different than what you described in the original post which implied renaming the variables to the same for each student.
You appear to be in the work of a MERGE, using data step, or JOIN using SQL. What happens with the other variables listed as 5 through 16 above? Do you want them to appear at all in the final data set?
If you just want the SchCode SchName and Grade variables using a data step approach:
First sort all of the data sets by the StdPseudoID variable.
data want
MERGE
Demog01_02 (keep= StdPseudoID SchCode02 SchName02 Grade02)
Demog02_03 (keep= StdPseudoID SchCode03 SchName03 Grade03)
/* repeat for each set changing the year numbers in set name and variables */
;
by StdPseudoID;
run;
No, what I want for variables 5 through 16 is the same as I described in the original post -- I do want variables 5 through 16 to be renamed and to appear in the final data set. I only want them to appear in the final data set once. This is because a student might be in the files for some years, but not others. Also, some students may have missing values. The example I gave in the original post is a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006. So, even though the values of these variables do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.
On the other hand, I want the school code, school name, and grade variables for EACH year to be included in the final data set. Variables 5 through 16 do not change year to year. Variables two through four DO change year to year. (If you scroll to the right in the table I just posted, you will see all the variables included as they should be in the final data set.) I did not include these three variables in my original post. That is the only difference. My apologies for the confusion. I was a bit confused myself.
A much more complicated problem.
What do you want for values of those things that look like demographics if they change between data sets? First one, last one or some other rule? (verify that they do not change before saying they don't. I have worked with data where Race and Gender change between visits due to data entry issues)
I see three steps.
Step one would be something like the first step I posted to get those variables that you want only one set of renamed. Then decide which set for those that appear in multiple datasets so that you have a single record per ID. How to do this can depend on what your keep rule will be. If your data is clean enough you could use PROC sort with the NODUP options or PROC SQL and a SELECT DISTINCT to do this step.
Then get the multi-year variables merged to one record similar to my second code example.
Last merge or join the two sets together based on the id value.
It's possible that there could be some changes from year to year due to data entry mistakes; however, I believe this would be in a small number of cases. Each of these files contains over 600,000 students, so it would be difficult to identify such mistakes. Most of the demographics are collected only one time (when the student enters the school district) and then that information is included in the information system. To try to capture all the data that I can, I think it makes sense to retain the most recent data value that is not missing for any given student. Take ELredesigDate for example -- If in 2002, ELredesigDate is missing, then look to 2003. If there is a value in 2003, use that. That value (in this case, a date) is the value that will be retained. Any subsequent years for which that student is in the district should contain that date in the ELredesigDate column. A student is only redesignated one time. For another student, it may be that ELredesigDate is missing in 2002, missing in 2003, missing in 2004, and then a date is provided in 2005. In this case, that 2005 value would be retained. Even after checking all years, some values will still be missing for some students. Does that make sense?
I certainly appreciate help with the code if you know how to do this. Thanks again.
One way to get the latest information is to start with the data set example I showed first where I had:
Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate
StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob
Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang
FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate
CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)
/* and since you really want to merge these back in later*/ drop=SchCode02 SchName02 Grade02
)
If your ElIdentDate was a SAS date variable this would get you the year:
year=Year(ElIdentDate);
/* but you need to parse the string to get the year using SUBSTR function since your dates are character (bad idea for SAS datasets by the way) without knowing what format I can't say for sure. If your date is mm/dd/yyyy in appearance then:*/
year = substring(ELIdentDate,7,4); /* 7 is position of first digit of year, 4 is how many characters. if you have dates like 1/2/2012 mixed with 10/10/2012 you will have to do more work*/
Then proc sort; by StdPseudoID year;run;
data wantoneid;
set want;
by StdPseudoID
;
if last.StdPseudoID;
run;
This data set will have one record for each id with the values of the variables that occured with that last record.
Then
Proc Sql;
create table final as
select a.*, b.*
from wantoneid as a left join <name of the dataset using the MERGE code example> as b on
a.StdPseudoID = b.StdPeudoID;
quit;
The above will have a warning about a variable alread in the data, that's the ID variable in both data sets but only one gets into the output. Lazy coding at this point.
I will try to work it out with this code. thanks again for your help.
Suppose library 'WORK' (HAS TO BE ALL CAPS in the following code) contains your nine inputting datasets only, and your output library is 'want', you may adopt the following code to save some typing:
proc sql;
select memname into :memname separated by ' ' from dictionary.tables
where libname='WORK' order by memname;quit;
%macro ren;
%do i=1 %to %sysfunc(countw(&memname));
%let dsn=%scan(&memname,&i);
proc sql;
select cats(name,'=',substr(name,1,length(name)-2)) into :rename separated by ' '
from dictionary.columns where LIBNAME='WORK' AND MEMNAME="%upcase(&dsn)";
proc datasets lib=work;
modify &dsn;
rename &rename;
run;
%end;
%mend;
%ren;
data want.want;
set &memname indsname=ds;
StdPseudoId=input(cats(compress(ds,,'kd'),StdPseudoId),best20.);
run;
Haikuo
As I understand your requirement you want a single table representing all the subjects in your study with the most up to date non null data for each. This is what the UPDATE statement in a datastep is designed to do.
You need to create a skeleton table to hold the wanted data.
Proc SQL ;
Create Table Want Like Demog01_02 ;
Quit ;
It's worth creating a simple macro to do the job.
%Macro NextYear(yr) ;
Data Want ;
Update want
Demog&yr ;
Run ;
%Mend ;
Then for only 10 years copy and paste is efficient
%NextYear(01_02 ) ;
%NextYear(02_03 ) ;
%NextYear(03_04 ) ;
etc
Regards
Richard in Oz
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.