BookmarkSubscribeRSS Feed
huanaca
Calcite | Level 5

I’m hoping someone can help with a data organization task.  I have nine data sets, one for each year from 2002 through 2010.  Each data set includes the same 13 variables. Each column contains a variable; each row represents a different student. These are very large files with many students.

The 13 variables include one unique ID (StdPseudoId) and 12 fixed effects (e.g. gender, date of birth, home language).  I would like to create one data set that includes all students and their respective values across the 13 variables.

What makes it just a little complicated is that many students are represented across multiple files.  For example, a student might be in the files for years 2002-2007, but not 2008-2010.  Another student might in the files for years 2005 and 2006 only. Also, some students may have missing values due to data entry errors. For example, a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006.  So, even though the values of any given variable do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.

So, what I have:


Nine data sets:


1) File name:  Demog01_02

    Variable list:

1

StdPseudoId

Number

2

LangClsCode02

Character

3

ELIdentDate02

Character

4

StdELRedesigDate02

Character

5

StdDob02

Character

6

Gender02

Character

7

Ethnicity02

Character

8

HomeLang02

Character

9

FRLunch02

Character

10

ParentEduLevel02

Character

11

US1stSchoolDate02

Character

12

CA1stSchoolDate02

Character

13

GradeFirstEnroll02

Number

2) File name:  Demog02_03

    Variable list:

1

StdPseudoId

Number

2

LangClsCode03

Character

3

ELIdentDate03

Character

4

StdELRedesigDate03

Character

5

StdDob03

Character

6

Gender03

Character

7

Ethnicity03

Character

8

HomeLang03

Character

9

FRLunch03

Character

10

ParentEduLevel03

Character

11

US1stSchoolDate03

Character

12

CA1stSchoolDate03

Character

13

GradeFirstEnroll03

Number

And the same structure for each data set, through 2010:

3) File name: Demog03_04

4) File name: Demog04_05

5) File name: Demog05_06

6) File name: Demog06_07

7) File name: Demog07_08

😎 File name: Demog08_09

9) File name: Demog09_10

What I want:

One data set with the same structure as the original data sets, but created with information from all nine data sets:

StdPseudoId

LangClsCode

ELIdentDate

StdELRedesigDate

StdDob

Gender

Ethnicity

HomeLang

FRLunch

ParentEduLevel

US1stSchoolDate

CA1stSchoolDate

GradeFirstEnroll

100001235

100002227

100008854

100022631

100095453

100506481

148260057

etc.

21 REPLIES 21
ballardw
Super User

Basically you just need to SET the data sets and rename the variables.

The INDSNAME option creates a variable, here named DataSource, which would contain the name of the input data set for each record. Remove if not needed.

 

data want;

set

     Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate

          StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob

          Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang

          FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate

          CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)

          INDSNAME = DataSource)

     Demog02_03 ( rename=( LangClsCode03 = LangClsCode ELIdentDate03 = ELIdentDate

          StdELRedesigDate03 = StdELRedesigDate StdDob03 = StdDob

          Gender03 = Gender Ethnicity03 = Ethnicity HomeLang03 = HomeLang

          FRLunch03 = FRLunch ParentEduLevel03 = ParentEduLevel US1stSchoolDate03 = US1stSchoolDate

          CA1stSchoolDate03 = CA1stSchoolDate GradeFirstEnroll03 = GradeFirstEnroll)

          INDSNAME = DataSource)

/* repeat for each of the data sets, using the 04 05 etc variables with each dataset*/

; /* notice this is the end of the SET statement*/

run;

huanaca
Calcite | Level 5

Thanks very much for your response.  I am receiving an error:

Invalid option name INDSNAME.

And the resulting file, want, has not variables, no observations.  Any thoughts?

thanks again.

LarryWorley
Fluorite | Level 6

I suspect you need to move the indsname = DataSource  option to outside of the dataset option parenthesis and only use it once.  indsname is a set statement  option.  Try something like this:

Set dataset1(rename = (..........))

      dataset2(rename = (..........))

    .

    .

    datasetn(rename = (...........)) indsname = DataSource

   ;

You might consider using KSharp's sql coding.  The only downside to that is the need for a column within the individual datasets which can identify the year.  You might have that (I did not do a review of the structure of your data).  If that is not available then for each of the select statements in the sql, you would need to add field to which a constant is applied.  Using KSharp's code, then something like:

proc sql;

create table want as

select  LangClsCode02 as LangClsCode,

         ELIdentDate02 as ELIdentDate,

          StdELRedesigDate02 as StdELRedesigDate,

          StdDob02 as StdDob,

          Gender02 as Gender ,

         Ethnicity02 as Ethnicity ,

         HomeLang02 as HomeLang,

          FRLunch02 as FRLunch,

         ParentEduLevel02 as ParentEduLevel ,

         US1stSchoolDate02 as US1stSchoolDate,

          CA1stSchoolDate02 as CA1stSchoolDate,

         GradeFirstEnroll02 as GradeFirstEnroll

         ,2002 as year

from  Demog01_02

union

select *, 2003 as year from  Demog02_03

union

select *,2004 as year from  Demog03_04

union

..........

;

quit;

huanaca
Calcite | Level 5

Ballard and Larry,

Thank you very much. When I moved INDSNAME = datasource to outside the dataset option parenthesis, the code ran without error. The resulting dataset had the correct variables. However, the resulting file contained multiple observations for each individual.  So if an individual was included in the original datasets for 2001, 2002, and 2003, than that individual was repeated three times in the new dataset. An individual that was included in 8 of the 9 original datasets, was repeated eight times in the new dataset. Also, there was no variable ‘datasource.’


Again, I am trying to represent each individual only one time in the new dataset. Any thoughts?


Thanks again. 

huanaca
Calcite | Level 5

Ksharp and Larry,

Thank you for your help. The files do not contain individual columns that identify the year (please see my original post that describes the data), so I added the year in the select statement as you suggested. I received an error message. The code and log are below:

proc sql;

create table want2 as

select LangClsCode02 as LangClsCode,

ELIdentDate02 as ELIdentDate,

StdELRedesigDate02 as StdELRedesigDate,

StdDob02 as StdDob,

Gender02 as Gender,

Ethnicity02 as Ethnicity,

         HomeLang02 as HomeLang,

FRLunch02 as FRLunch,

ParentEduLevel02 as ParentEduLevel,

US1stSchoolDate02 as US1stSchoolDate,

CA1stSchoolDate02 as CA1stSchoolDate,

GradeFirstEnroll02 as GradeFirstEnroll,

             2002 as year

from temp01_02

union

select *, 2003 as year from temp02_03

union

select *, 2004 as year from temp03_04

union

select *, 2005 as year from temp04_05

union

select *, 2006 as year from temp05_06

union

select *, 2007 as year from temp06_07

union

select *, 2008 as year from temp07_08

union

select *, 2009 as year from temp08_09

union

select *, 2010 as year from temp09_10

union

;

quit;

Here is the log:

1668  proc sql;

1669

1670  create table want2 as

1671  select  LangClsCode02 as LangClsCode,

1672 ELIdentDate02 as ELIdentDate,

1673 StdELRedesigDate02 as StdELRedesigDate,

1674 StdDob02 as StdDob,

1675 Gender02 as Gender,

1676 Ethnicity02 as Ethnicity,

1677 HomeLang02 as HomeLang,

1678 FRLunch02 as FRLunch,

1679 ParentEduLevel02 as ParentEduLevel,

1680 US1stSchoolDate02 as US1stSchoolDate,

1681 CA1stSchoolDate02 as CA1stSchoolDate,

1682 GradeFirstEnroll02 as GradeFirstEnroll,

1683           2002 as year

1684  from  temp01_02

1685  union

1686  select *, 2003 as year from  temp02_03

1687  union

1688  select *, 2004 as year from  temp03_04

1689  union

1690  select *, 2005 as year from  temp04_05

1691  union

1692  select *, 2006 as year from  temp05_06

1693  union

1694  select *, 2007 as year from  temp06_07

1695  union

1696  select *, 2008 as year from  temp07_08

1697  union

1698  select *, 2009 as year from  temp08_09

1699  union

1700  select *, 2010 as year from  temp09_10

1701  union

1702  ;

      -

      22

      76

ERROR 22-322: Syntax error, expecting one of the following: (, ALL, CORR, CORRESPONDING, SELECT.

ERROR 76-322: Syntax error, statement will be ignored.

1703  quit;

NOTE: The SAS System stopped processing this step because of errors.

huanaca
Calcite | Level 5

Hi again Ballard and Larry,

My apologies! I realized I made a mistake that might have caused the problem I described in the response I just sent to you.  The original nine datasets each include 3 additional variables that DO change with every year. They are: SchCode, SchName, and Grade. For each year, I want to keep these variables, because they change. I wasn't thinking about how those additional variables would cause the individuals to be represented more than once in the new dataset. Could you please help me modify your code to account for all the variables?

To clarify, the nine datasets look like this:

1) File name:  Demog01_02

    Variable list:

1

StdPseudoId

Numeric

2

SchCode02

Numeric

3

SchName02

Character

4

Grade02

Character

5

LangClsCode02

Character

6

ELIdentDate02

Character

7

StdELRedesigDate02

Character

8

StdDob02

Character

9

Gender02

Character

10

Ethnicity02

Character

11

HomeLang02

Character

12

FRLunch02

Character

13

ParentEduLevel02

Character

14

US1stSchoolDate02

Character

15

CA1stSchoolDate02

Character

16

GradeFirstEnroll02

Numeric

2) File name: Demog02_03

    Variable list:

1

StdPseudoId

Numeric

2

SchCode03

Numeric

3

SchName03

Character

4

Grade03

Character

5

LangClsCode03

Character

6

ELIdentDate03

Character

7

StdELRedesigDate03

Character

8

StdDob03

Character

9

Gender03

Character

10

Ethnicity03

Character

11

HomeLang03

Character

12

FRLunch03

Character

13

ParentEduLevel03

Character

14

US1stSchoolDate03

Character

15

CA1stSchoolDate03

Character

16

GradeFirstEnroll03

Numeric

And the same structure for each data set, through 2010:

3) File name: Demog03_04

4) File name: Demog04_05

5) File name: Demog05_06

6) File name: Demog06_07

7) File name: Demog07_08

😎 File name: Demog08_09

9) File name: Demog09_10

What I want:

One data set with the same structure as the original data sets, but created with information from all nine data sets (keeping ALL SchCode, SchName, and Grade variables):

StdPseudoId

SchCode02

SchName02

Grade02

SchCode03

SchName03

Grade03

SchCode04

SchName04

Grade04

SchCode05SchName05Grade05SchCode06SchName06Grade06SchCode07SchName07Grade07SchCode08SchName08Grade08SchCode09SchName09Grade09SchCode10SchName10Grade10

LangClsCode

ELIdentDate

StdELRedesigDate

StdDob

Gender

Ethnicity

HomeLang

FRLunch

ParentEduLevel

US1stSchoolDate

CA1stSchoolDate

GradeFirstEnroll

100001235

100002227

100008854

100022631

100095453

100506481

148260057

etc.

ballardw
Super User

This is significantly different than what you described in the original post which implied renaming the variables to the same for each student.

You appear to be in the work of a MERGE, using data step, or JOIN using SQL. What happens with the other variables listed as 5 through 16 above? Do you want them to appear at all in the final data set?

If you just want the SchCode SchName and Grade variables using a data step approach:

First sort all of the data sets by the StdPseudoID variable.

data want

     MERGE

          Demog01_02 (keep= StdPseudoID SchCode02 SchName02 Grade02)

          Demog02_03 (keep= StdPseudoID SchCode03 SchName03 Grade03)

/* repeat for each set changing the year numbers in set name and variables */

     ;

     by StdPseudoID;

run;

huanaca
Calcite | Level 5

No, what I want for variables 5 through 16 is the same as I described in the original post -- I do want variables 5 through 16 to be renamed and to appear in the final data set. I only want them to appear in the final data set once.  This is because a student might be in the files for some years, but not others.  Also, some students may have missing values. The example I gave in the original post is a student who is present in files 2002-2006 may be missing a gender value in 2002 and 2004, but be identified as female in 2003, 2005, and 2006.  So, even though the values of these variables do not change over the years (e.g. a student who is female in 2003 is also female in 2005), I have to get information from all nine files in order to create one comprehensive file.

On the other hand, I want the school code, school name, and grade variables for EACH year to be included in the final data set. Variables 5 through 16 do not change year to year. Variables two through four DO change year to year.  (If you scroll to the right in the table I just posted, you will see all the variables included as they should be in the final data set.)  I did not include these three variables in my original post.  That is the only difference. My apologies for the confusion. I was a bit confused myself.

ballardw
Super User

A much more complicated problem.

What do you want for values of those things that look like demographics if they change between data sets? First one, last one or some other rule? (verify that they do not change before saying they don't. I have worked with data where Race and Gender change between visits due to data entry issues)

I see three steps.

Step one would be something like the first step I posted to get those variables that you want only one set of renamed. Then decide which set for those that appear in multiple datasets so that you have a single record per ID. How to do this can depend on what your keep rule will be. If your data is clean enough you could use PROC sort with the NODUP options or PROC SQL and a SELECT DISTINCT to do this step.

Then get the multi-year variables merged to one record similar to my second code example.

Last merge or join the two sets together based on the id value.

huanaca
Calcite | Level 5

It's possible that there could be some changes from year to year due to data entry mistakes; however, I believe this would be in a small number of cases.  Each of these files contains over 600,000 students, so it would be difficult to identify such mistakes. Most of the demographics are collected only one time (when the student enters the school district) and then that information is included in the information system.  To try to capture all the data that I can, I think it makes sense to retain the most recent data value that is not missing for any given student.  Take ELredesigDate for example -- If in 2002, ELredesigDate is missing, then look to 2003.  If there is a value in 2003, use that.  That value (in this case, a date) is the value that will be retained. Any subsequent years for which that student is in the district should contain that date in the ELredesigDate column.  A student is only redesignated one time.  For another student, it may be that ELredesigDate is missing in 2002, missing in 2003, missing in 2004, and then a date is provided in 2005.  In this case, that 2005 value would be retained.  Even after checking all years, some values will still be missing for some students.  Does that make sense? 

I certainly appreciate help with the code if you know how to do this.  Thanks again.

ballardw
Super User


One way to get the latest information is to start with the data set example I showed first where I had:

     Demog01_02 ( rename=( LangClsCode02 = LangClsCode ELIdentDate02 = ELIdentDate

          StdELRedesigDate02 = StdELRedesigDate StdDob02 = StdDob

          Gender02 = Gender Ethnicity02 = Ethnicity HomeLang02 = HomeLang

          FRLunch02 = FRLunch ParentEduLevel02 = ParentEduLevel US1stSchoolDate02 = US1stSchoolDate

          CA1stSchoolDate02 = CA1stSchoolDate GradeFirstEnroll02 = GradeFirstEnroll)

/* and since you really want to merge these back in later*/ drop=SchCode02 SchName02 Grade02

            )

If your ElIdentDate was a SAS date variable this would get you the year:

     year=Year(ElIdentDate);

/* but you need to parse the string to get the year using SUBSTR function since your dates are character (bad idea for SAS datasets by the way) without knowing what format I can't say for sure. If your date is mm/dd/yyyy in appearance then:*/

year = substring(ELIdentDate,7,4); /* 7 is position of first digit of year, 4 is how many characters. if you have dates like 1/2/2012 mixed with 10/10/2012 you will have to do more work*/

Then proc sort; by StdPseudoID year;run;

data wantoneid;

     set want;

     by StdPseudoID

;

     if last.StdPseudoID;

run;

This data set will have one record for each id with the values of the variables that occured with that last record.

Then

Proc Sql;

     create table final as

     select a.*, b.*

     from wantoneid as a left join <name of the dataset using the MERGE code example> as b on

          a.StdPseudoID = b.StdPeudoID;

quit;

The above will have a warning about a variable alread in the data, that's the ID variable in both data sets but only one gets into the output. Lazy coding at this point.

huanaca
Calcite | Level 5

I will try to work it out with this code. thanks again for your help. 

Haikuo
Onyx | Level 15

Suppose library 'WORK' (HAS TO BE ALL CAPS in the following code) contains your nine inputting datasets only, and your output library is 'want',  you may adopt the following code to save some typing:

proc sql;

select memname into :memname separated by ' ' from dictionary.tables

where libname='WORK' order by memname;quit;

%macro ren;

%do i=1 %to %sysfunc(countw(&memname));

%let dsn=%scan(&memname,&i);

proc sql;

  select cats(name,'=',substr(name,1,length(name)-2)) into :rename separated by ' '

from dictionary.columns where LIBNAME='WORK' AND MEMNAME="%upcase(&dsn)";

proc datasets lib=work;

modify &dsn;

rename &rename;

run;

%end;

%mend;

%ren;

data want.want;

  set &memname indsname=ds;

  StdPseudoId=input(cats(compress(ds,,'kd'),StdPseudoId),best20.);

run;

Haikuo

RichardinOz
Quartz | Level 8

As I understand your requirement you want a single table representing all the subjects in your study with the most up to date non null data for each.  This is what the UPDATE statement in a datastep is designed to do.

You need to create a skeleton table to hold the wanted data.

Proc SQL ;

    Create Table Want Like Demog01_02 ;

Quit ;

It's worth creating a simple macro to do the job.

%Macro NextYear(yr) ;

  Data Want ;

     Update want

                Demog&yr ;

  Run ;

%Mend ;

Then for only 10 years copy and paste is efficient

%NextYear(01_02 ) ;

%NextYear(02_03 ) ;

%NextYear(03_04 ) ;

etc

Regards

Richard in Oz

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 4911 views
  • 0 likes
  • 6 in conversation