Re: Create one data set with info from multiple data sets

huanaca · Posted 10-23-2012 03:25 PM

I’m hoping someone can help with a data organization task. I have nine data sets, one for each year from 2002 through 2010. Each data set includes the same 13 variables. Each column contains a variable; each row represents a different student. These are very large files with many students.

The 13 variables include one unique ID (StdPseudoId) and 12 fixed effects (e.g. gender, date of birth, home language). I would like to create one data set that includes all students and their respective values across the 13 variables.

What makes it just a little complicated is that many students are represented across multiple files. For example, a student might be in the files for years 2002-2007, but not 2008-2010. Another student might in the files for years 2005 and 2006 only. Also, some students may have missing values due to data entry errors. For example, a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006. So, even though the values of any given variable do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.

So, what I have:

Nine data sets:

1) File name: Demog01_02

Variable list:

1	StdPseudoId	Number
2	LangClsCode02	Character
3	ELIdentDate02	Character
4	StdELRedesigDate02	Character
5	StdDob02	Character
6	Gender02	Character
7	Ethnicity02	Character
8	HomeLang02	Character
9	FRLunch02	Character
10	ParentEduLevel02	Character
11	US1stSchoolDate02	Character
12	CA1stSchoolDate02	Character
13	GradeFirstEnroll02	Number

2) File name: Demog02_03

Variable list:

1	StdPseudoId	Number
2	LangClsCode03	Character
3	ELIdentDate03	Character
4	StdELRedesigDate03	Character
5	StdDob03	Character
6	Gender03	Character
7	Ethnicity03	Character
8	HomeLang03	Character
9	FRLunch03	Character
10	ParentEduLevel03	Character
11	US1stSchoolDate03	Character
12	CA1stSchoolDate03	Character
13	GradeFirstEnroll03	Number

And the same structure for each data set, through 2010:

3) File name: Demog03_04

4) File name: Demog04_05

5) File name: Demog05_06

6) File name: Demog06_07

7) File name: Demog07_08

😎 File name: Demog08_09

9) File name: Demog09_10

What I want:

One data set with the same structure as the original data sets, but created with information from all nine data sets:

StdPseudoId

LangClsCode

ELIdentDate

StdELRedesigDate

StdDob

Gender

Ethnicity

HomeLang

FRLunch

ParentEduLevel

US1stSchoolDate

CA1stSchoolDate

GradeFirstEnroll

100001235

100002227

100008854

100022631

100095453

100506481

148260057

etc.

ballardw · Posted 10-23-2012 03:44 PM

Basically you just need to SET the data sets and rename the variables.

The INDSNAME option creates a variable, here named DataSource, which would contain the name of the input data set for each record. Remove if not needed.

data want;

set

Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate

StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob

Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang

FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate

CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)

INDSNAME = DataSource)

Demog02_03 ( rename=( LangClsCode03 = LangClsCode ELIdentDate03 = ELIdentDate

StdELRedesigDate03 = StdELRedesigDate StdDob03 = StdDob

Gender03 = Gender Ethnicity03 = Ethnicity HomeLang03 = HomeLang

FRLunch03 = FRLunch ParentEduLevel03 = ParentEduLevel US1stSchoolDate03 = US1stSchoolDate

CA1stSchoolDate03 = CA1stSchoolDate GradeFirstEnroll03 = GradeFirstEnroll)

INDSNAME = DataSource)

/* repeat for each of the data sets, using the 04 05 etc variables with each dataset*/

; /* notice this is the end of the SET statement*/

run;

huanaca · Posted 10-25-2012 03:05 PM

Thanks very much for your response. I am receiving an error:

Invalid option name INDSNAME.

And the resulting file, want, has not variables, no observations. Any thoughts?

thanks again.

LarryWorley · Posted 10-25-2012 03:34 PM

I suspect you need to move the indsname = DataSource option to outside of the dataset option parenthesis and only use it once. indsname is a set statement option. Try something like this:

Set dataset1(rename = (..........))

dataset2(rename = (..........))

.

datasetn(rename = (...........)) indsname = DataSource

;

You might consider using KSharp's sql coding. The only downside to that is the need for a column within the individual datasets which can identify the year. You might have that (I did not do a review of the structure of your data). If that is not available then for each of the select statements in the sql, you would need to add field to which a constant is applied. Using KSharp's code, then something like:

proc sql;

create table want as

select LangClsCode02 as LangClsCode,

ELIdentDate02 as ELIdentDate,

StdELRedesigDate02 as StdELRedesigDate,

StdDob02 as StdDob,

Gender02 as Gender ,

Ethnicity02 as Ethnicity ,

HomeLang02 as HomeLang,

FRLunch02 as FRLunch,

ParentEduLevel02 as ParentEduLevel ,

US1stSchoolDate02 as US1stSchoolDate,

CA1stSchoolDate02 as CA1stSchoolDate,

GradeFirstEnroll02 as GradeFirstEnroll

,2002 as year

from Demog01_02

union

select *, 2003 as year from Demog02_03

union

select *,2004 as year from Demog03_04

union

..........

;

quit;

huanaca · Posted 10-26-2012 12:18 PM

Ballard and Larry,

Thank you very much. When I moved INDSNAME = datasource to outside the dataset option parenthesis, the code ran without error. The resulting dataset had the correct variables. However, the resulting file contained multiple observations for each individual. So if an individual was included in the original datasets for 2001, 2002, and 2003, than that individual was repeated three times in the new dataset. An individual that was included in 8 of the 9 original datasets, was repeated eight times in the new dataset. Also, there was no variable ‘datasource.’

Again, I am trying to represent each individual only one time in the new dataset. Any thoughts?

Thanks again.

huanaca · Posted 10-26-2012 12:23 PM

Ksharp and Larry,

Thank you for your help. The files do not contain individual columns that identify the year (please see my original post that describes the data), so I added the year in the select statement as you suggested. I received an error message. The code and log are below:

proc sql;

create table want2 as

select LangClsCode02 as LangClsCode,

ELIdentDate02 as ELIdentDate,

StdELRedesigDate02 as StdELRedesigDate,

StdDob02 as StdDob,

Gender02 as Gender,

Ethnicity02 as Ethnicity,

HomeLang02 as HomeLang,

FRLunch02 as FRLunch,

ParentEduLevel02 as ParentEduLevel,

US1stSchoolDate02 as US1stSchoolDate,

CA1stSchoolDate02 as CA1stSchoolDate,

GradeFirstEnroll02 as GradeFirstEnroll,

2002 as year

from temp01_02

union

select *, 2003 as year from temp02_03

union

select *, 2004 as year from temp03_04

union

select *, 2005 as year from temp04_05

union

select *, 2006 as year from temp05_06

union

select *, 2007 as year from temp06_07

union

select *, 2008 as year from temp07_08

union

select *, 2009 as year from temp08_09

union

select *, 2010 as year from temp09_10

union

;

quit;

Here is the log:

1668 proc sql;

1669

1670 create table want2 as

1671 select LangClsCode02 as LangClsCode,

1672 ELIdentDate02 as ELIdentDate,

1673 StdELRedesigDate02 as StdELRedesigDate,

1674 StdDob02 as StdDob,

1675 Gender02 as Gender,

1676 Ethnicity02 as Ethnicity,

1677 HomeLang02 as HomeLang,

1678 FRLunch02 as FRLunch,

1679 ParentEduLevel02 as ParentEduLevel,

1680 US1stSchoolDate02 as US1stSchoolDate,

1681 CA1stSchoolDate02 as CA1stSchoolDate,

1682 GradeFirstEnroll02 as GradeFirstEnroll,

1683 2002 as year

1684 from temp01_02

1685 union

1686 select *, 2003 as year from temp02_03

1687 union

1688 select *, 2004 as year from temp03_04

1689 union

1690 select *, 2005 as year from temp04_05

1691 union

1692 select *, 2006 as year from temp05_06

1693 union

1694 select *, 2007 as year from temp06_07

1695 union

1696 select *, 2008 as year from temp07_08

1697 union

1698 select *, 2009 as year from temp08_09

1699 union

1700 select *, 2010 as year from temp09_10

1701 union

1702 ;

-

22

76

ERROR 22-322: Syntax error, expecting one of the following: (, ALL, CORR, CORRESPONDING, SELECT.

ERROR 76-322: Syntax error, statement will be ignored.

1703 quit;

NOTE: The SAS System stopped processing this step because of errors.

huanaca · Posted 10-26-2012 01:37 PM

Hi again Ballard and Larry,

My apologies! I realized I made a mistake that might have caused the problem I described in the response I just sent to you. The original nine datasets each include 3 additional variables that DO change with every year. They are: SchCode, SchName, and Grade. For each year, I want to keep these variables, because they change. I wasn't thinking about how those additional variables would cause the individuals to be represented more than once in the new dataset. Could you please help me modify your code to account for all the variables?

To clarify, the nine datasets look like this:

1) File name: Demog01_02

Variable list:

1	StdPseudoId	Numeric
2	SchCode02	Numeric
3	SchName02	Character
4	Grade02	Character
5	LangClsCode02	Character
6	ELIdentDate02	Character
7	StdELRedesigDate02	Character
8	StdDob02	Character
9	Gender02	Character
10	Ethnicity02	Character
11	HomeLang02	Character
12	FRLunch02	Character
13	ParentEduLevel02	Character
14	US1stSchoolDate02	Character
15	CA1stSchoolDate02	Character
16	GradeFirstEnroll02	Numeric

2) File name: Demog02_03

Variable list:

1	StdPseudoId	Numeric
2	SchCode03	Numeric
3	SchName03	Character
4	Grade03	Character
5	LangClsCode03	Character
6	ELIdentDate03	Character
7	StdELRedesigDate03	Character
8	StdDob03	Character
9	Gender03	Character
10	Ethnicity03	Character
11	HomeLang03	Character
12	FRLunch03	Character
13	ParentEduLevel03	Character
14	US1stSchoolDate03	Character
15	CA1stSchoolDate03	Character
16	GradeFirstEnroll03	Numeric

And the same structure for each data set, through 2010:

3) File name: Demog03_04

4) File name: Demog04_05

5) File name: Demog05_06

6) File name: Demog06_07

7) File name: Demog07_08

😎 File name: Demog08_09

9) File name: Demog09_10

What I want:

One data set with the same structure as the original data sets, but created with information from all nine data sets (keeping ALL SchCode, SchName, and Grade variables):

StdPseudoId

SchCode02

SchName02

Grade02

SchCode03

SchName03

Grade03

SchCode04

SchName04

Grade04

SchCode05

SchName05

Grade05

SchCode06

SchName06

Grade06

SchCode07

SchName07

Grade07

SchCode08

SchName08

Grade08

SchCode09

SchName09

Grade09

SchCode10

SchName10

Grade10

LangClsCode

ELIdentDate

StdELRedesigDate

StdDob

Gender

Ethnicity

HomeLang

FRLunch

ParentEduLevel

US1stSchoolDate

CA1stSchoolDate

GradeFirstEnroll

100001235

100002227

100008854

100022631

100095453

100506481

148260057

etc.

ballardw · Posted 10-26-2012 02:10 PM

This is significantly different than what you described in the original post which implied renaming the variables to the same for each student.

You appear to be in the work of a MERGE, using data step, or JOIN using SQL. What happens with the other variables listed as 5 through 16 above? Do you want them to appear at all in the final data set?

If you just want the SchCode SchName and Grade variables using a data step approach:

First sort all of the data sets by the StdPseudoID variable.

data want

MERGE

Demog01_02 (keep= StdPseudoID SchCode02 SchName02 Grade02)

Demog02_03 (keep= StdPseudoID SchCode03 SchName03 Grade03)

/* repeat for each set changing the year numbers in set name and variables */

;

by StdPseudoID;

run;

huanaca · Posted 10-26-2012 02:49 PM

No, what I want for variables 5 through 16 is the same as I described in the original post -- I do want variables 5 through 16 to be renamed and to appear in the final data set. I only want them to appear in the final data set once. This is because a student might be in the files for some years, but not others. Also, some students may have missing values. The example I gave in the original post is a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006. So, even though the values of these variables do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.

On the other hand, I want the school code, school name, and grade variables for EACH year to be included in the final data set. Variables 5 through 16 do not change year to year. Variables two through four DO change year to year. (If you scroll to the right in the table I just posted, you will see all the variables included as they should be in the final data set.) I did not include these three variables in my original post. That is the only difference. My apologies for the confusion. I was a bit confused myself.

ballardw · Posted 10-26-2012 04:57 PM

A much more complicated problem.

What do you want for values of those things that look like demographics if they change between data sets? First one, last one or some other rule? (verify that they do not change before saying they don't. I have worked with data where Race and Gender change between visits due to data entry issues)

I see three steps.

Step one would be something like the first step I posted to get those variables that you want only one set of renamed. Then decide which set for those that appear in multiple datasets so that you have a single record per ID. How to do this can depend on what your keep rule will be. If your data is clean enough you could use PROC sort with the NODUP options or PROC SQL and a SELECT DISTINCT to do this step.

Then get the multi-year variables merged to one record similar to my second code example.

Last merge or join the two sets together based on the id value.

huanaca · Posted 10-26-2012 05:19 PM

It's possible that there could be some changes from year to year due to data entry mistakes; however, I believe this would be in a small number of cases. Each of these files contains over 600,000 students, so it would be difficult to identify such mistakes. Most of the demographics are collected only one time (when the student enters the school district) and then that information is included in the information system. To try to capture all the data that I can, I think it makes sense to retain the most recent data value that is not missing for any given student. Take ELredesigDate for example -- If in 2002, ELredesigDate is missing, then look to 2003. If there is a value in 2003, use that. That value (in this case, a date) is the value that will be retained. Any subsequent years for which that student is in the district should contain that date in the ELredesigDate column. A student is only redesignated one time. For another student, it may be that ELredesigDate is missing in 2002, missing in 2003, missing in 2004, and then a date is provided in 2005. In this case, that 2005 value would be retained. Even after checking all years, some values will still be missing for some students. Does that make sense?

I certainly appreciate help with the code if you know how to do this. Thanks again.

ballardw · Posted 10-26-2012 06:42 PM

One way to get the latest information is to start with the data set example I showed first where I had:

Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate

StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob

Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang

FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate

CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)

/* and since you really want to merge these back in later*/ drop=SchCode02 SchName02 Grade02

)

If your ElIdentDate was a SAS date variable this would get you the year:

year=Year(ElIdentDate);

/* but you need to parse the string to get the year using SUBSTR function since your dates are character (bad idea for SAS datasets by the way) without knowing what format I can't say for sure. If your date is mm/dd/yyyy in appearance then:*/

year = substring(ELIdentDate,7,4); /* 7 is position of first digit of year, 4 is how many characters. if you have dates like 1/2/2012 mixed with 10/10/2012 you will have to do more work*/

Then proc sort; by StdPseudoID year;run;

data wantoneid;

set want;

by StdPseudoID

;

if last.StdPseudoID;

run;

This data set will have one record for each id with the values of the variables that occured with that last record.

Then

Proc Sql;

create table final as

select a.*, b.*

from wantoneid as a left join <name of the dataset using the MERGE code example> as b on

a.StdPseudoID = b.StdPeudoID;

quit;

The above will have a warning about a variable alread in the data, that's the ID variable in both data sets but only one gets into the output. Lazy coding at this point.

huanaca · Posted 10-27-2012 11:01 AM

I will try to work it out with this code. thanks again for your help.

Haikuo · Posted 10-23-2012 04:33 PM

Suppose library 'WORK' (HAS TO BE ALL CAPS in the following code) contains your nine inputting datasets only, and your output library is 'want', you may adopt the following code to save some typing:

proc sql;

select memname into :memname separated by ' ' from dictionary.tables

where libname='WORK' order by memname;quit;

%macro ren;

%do i=1 %to %sysfunc(countw(&memname));

%let dsn=%scan(&memname,&i);

proc sql;

select cats(name,'=',substr(name,1,length(name)-2)) into :rename separated by ' '

from dictionary.columns where LIBNAME='WORK' AND MEMNAME="%upcase(&dsn)";

proc datasets lib=work;

modify &dsn;

rename &rename;

run;

%end;

%mend;

%ren;

data want.want;

set &memname indsname=ds;

StdPseudoId=input(cats(compress(ds,,'kd'),StdPseudoId),best20.);

run;

Haikuo

RichardinOz · Posted 10-23-2012 07:51 PM

As I understand your requirement you want a single table representing all the subjects in your study with the most up to date non null data for each. This is what the UPDATE statement in a datastep is designed to do.

You need to create a skeleton table to hold the wanted data.

Proc SQL ;

Create Table Want Like Demog01_02 ;

Quit ;

It's worth creating a simple macro to do the job.

%Macro NextYear(yr) ;

Data Want ;

Update want

Demog&yr ;

Run ;

%Mend ;

Then for only 10 years copy and paste is efficient

%NextYear(01_02 ) ;

%NextYear(02_03 ) ;

%NextYear(03_04 ) ;

etc

Regards

Richard in Oz

Registration is open

SAS Training: Just a Click Away