Solved: Re: Using info from one dataset to create variable in a second dataset

telc24 · Posted 08-26-2020 07:29 AM

Hi everyone,

I'm hoping someone can provide guidance on this question below. I have two datasets that i'm working with:

Dataset 1: Student-level information- Tells me which students have ever attended schools; and there is an identifier stating which county they currently live in (e.g., student A has attended school 4 and lives in county 2)

data student;

input id $ school1 school2 school3 school4 school5 school6 school7 county;

cards;

A 0 0 0 1 0 0 0 2

B 1 1 0 0 0 0 0 1

C 1 0 1 0 0 1 0 3

D 0 1 0 0 0 0 0 2

E 0 0 0 0 1 1 0 4

F 0 0 0 0 0 0 1 5

;

run;

Dataset 2: Location (county) where each of seven schools are located (Note that schools can only be located in one county, but there can be multiple schools in a single county. Schools (school2) can be located in more than 1 county (county 2 and 3))

data county

input county school1 school2 school3 school4 school5 school6 school7;

cards;

1 1 0 0 1 0 0 0

2 0 1 0 0 0 0 0

3 0 1 0 0 0 0 0

4 0 0 1 0 1 1 0

5 0 0 0 0 0 0 1

;

run;

What I need to determine is: Using the county dataset, which students living in a specific county attended a school that is served by that county. I’ve determined how to code this in SAS, but am looking for a more efficient way to write this code. I’ve done this in a brute force type of method (see code below), but as you can imagine, this starts to become unwieldy as the number of counties and schools increases. I’m not sure if it would be possible to do this with some type of macro that utilizes information from one dataset (county) to create a new variable in another dataset (student).

data student2; set student;

if county=1 and (school1=1 or school4=1) then indicator=1;

if county in (2 3) and (school2=1) then indicator=2;

if county=4 and (school3=1 or school5=1 or school6=1) then indicator=3;

if county=5 and (school7=1) then indicator=4;

run;

Example of the resulting dataset called student2. Note the new indicator variable, and that some indicator variables=. if a student is located in a county but didn’t attend any of the schools served by that county.

id school1 school2 school3 school4 school5 school6 school7 county indicator

A 0 0 0 1 0 0 0 2 .

B 1 1 0 0 0 0 0 1 1

C 1 1 0 0 0 1 0 3 .

D 0 1 0 0 0 0 0 2 2

E 0 0 0 0 1 1 0 4 3

F 0 0 0 1 0 0 0 5 4

Thanks in advance for your insight!

mkeintz · Posted 08-27-2020 01:20 AM

@telc24 wrote:

@mkeintz Apologies, I should have specified more clearly- the 'rules' for my indicator variable are in my "brute force" method. I've pasted them again below.

data student2; set student;

if county=1 and (school1=1 or school4=1) then indicator=1;

if county in (2 3) and (school2=1) then indicator=2;

if county=4 and (school3=1 or school5=1 or school6=1) then indicator=3;

if county=5 and (school7=1) then indicator=4;

run;

OK, I think I get it. You want an indicator for each unique set of schools found in the county dataset. You are apparently assuming that

A county can be served by a unique set of schools, none of which serve other schools.
or
Multiple counties may share a unique set of schools.
counties may not partially share schools (i.e. county A can't be served by schools 1,2, and 3 while county B is served by 1 and 2.

The program below should do:

data student;
input id $ school1 school2 school3 school4 school5 school6 school7 county;
cards;
A 0 0 0 1 0 0 0 2
B 1 1 0 0 0 0 0 1
C 1 0 1 0 0 1 0 3
D 0 1 0 0 0 0 0 2
E 0 0 0 0 1 1 0 4
F 0 0 0 0 0 0 1 5
 ;
run;

data county;
input county school1 school2 school3 school4 school5 school6 school7;
cards;
1 1 0 0 1 0 0 0
2 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 0 1 0 1 1 0
5 0 0 0 0 0 0 1
;


data want (keep=id indicator);

  /* Use a hash object to detect and map each unique pattern of schools */
  /* dummies found in the COUNTY data set to a unique INDICATOR value   */
  if _n_=1 then do c=1 by 1 until (end_of_counties);
    set county end=end_of_counties;
	length ones_and_zeroes $50;
    ones_and_zeroes=cats(of sch:);
    if c=1 then do;
       declare hash patterns();
       patterns.definekey('ones_and_zeroes');
       patterns.definedata('indicator');
       patterns.definedone();
    end;

	/* New pattern of zeros and ones?  Then add to hash */
	if patterns.find()^=0 then do;
	  indicator=patterns.num_items+1;
	  patterns.add();
	end;

	/* Map county to indicator (a proxy for school patterns) */
	array cty_ind  {5} _temporary_;
	cty_ind{county}=indicator;

	/* Map each school to indicator also */
    array sch_ind    {7} _temporary_;
    array _sch {7} school1-school7;
    do s=1 to dim(_sch);
      if _sch{s}=1 then sch_ind{s}=indicator;
    end;
  end;

  /** Now read student counties and compare to the map above **/
  set student;
  indicator=0;

  /* Scan over all attended schools until a non-zero indicator */
  /* is found, or no more schools to check                     */
  schoolid=whichn(1,of school1-school7) ;
  do while (schoolid^=0 and indicator=0);
    if cty_ind{county}=sch_ind{schoolid} then indicator=sch_ind{schoolid};
	if indicator=0 then _sch{schoolid}=0;
    schoolid=whichn(1,of school1-school7) ;  /*Go to next attended school */
  end;
run;

The hash object is just used to record all the unique school combinations found in COUNTIES, incrementing INDICATOR each time a new combination is encountered. This also provides a map of COUNTIES to the indicator, as well as a map of its constituent schools to the indicator.

Then read the students. If any the student's schools maps to the same indicator as the student's county does, then you have a non-zero indicator value.

Additional explanatory notes:

The “trick” here is to realize that distinct sets of schools serve distinct counties. So instead of determining whether a student attends any school that serves the same county as the student lives in, you actually want to know if the student attends a school that is a member of a set-of-schools that serves that student's county.

Of course, this works because you assume that each school belongs to only one distinct set-of-schools

process the county dataset first.

First you have to set up indicator variables corresponding to these sets-of-schools. As you discover a new set, assign INDICATOR=1, then 2, etc. You do this by reading through the county dataset. With each incoming record, generate the “pattern” of schools serving that county (i.e. concatenated ones and zeroes representing the school dummies). So if school 1 and 4 serves the incoming county (school1=1 and school4=1), and all the other school vars are zero, then the pattern of zeroes-and-ones is 1001000.

Once the pattern is in hand, see if it has already been encountered via hash find method, which will retrieve the corresponding INDICATOR. But if it is new (find method failed), then set the new INDICATOR value to 1 more than the number of known patterns. Store that value in the hash object with a lookup key of 1001000 (or whatever the new key is). Effectively INDICATOR is nothing more than a id variable for each distinct pattern of schools in the COUNTY dataset, in the order they are encountered.

So much for patterns as lookup for INDICATOR. Now

Put INDICATOR in the corresponding element of the county array CTY_IND.
And then loop over the school dummies in the county dataset. For any dummy=1 put INDICATOR in the corresponding school array (SCH_IND) element.

So much for the county dataset, now for the students:

Finally you’re ready to process the students dataset. For each incoming student, get the position of the leftmost school with dummy=1 (that's the whichn(1,of school1-school7) statement). If that school is a member of the same set of schools as serve the student’s county-of-residence, then the two array elements (sch_ind{schoolid) and cnt_ind{county}) will have the same indicator value, and you don't need to examine any more schools for that student. But if not, then set that school dummy to zero, so the next left-most can be found and tested.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Kurt_Bremser · Posted 08-26-2020 07:54 AM

Student C is from county 3, but did not attend the only school there (school 2). Please review your data and expected results.

Also make sure that steps you post actually work (id needs to be read as character, the second data step misses a semicolon).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-26-2020 07:57 AM

And F is from county 5 and has attended the school there, so the indicator should not be missing.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-26-2020 08:12 AM

As usual, coding becomes easier with a long dataset structure:

data student;
input id $ school1 school2 school3 school4 school5 school6 school7 county;
cards;
A 0 0 0 1 0 0 0 2
B 1 1 0 0 0 0 0 1
C 1 0 1 0 0 1 0 3
D 0 1 0 0 0 0 0 2
E 0 0 0 0 1 1 0 4
F 0 0 0 0 0 0 1 5
;

proc transpose
  data=student
  out=student_long (
    rename=(_name_=school)
    where=(col1 ne 0)
  )
;
by id county;
var school:;
run;

data county;
input county school1 school2 school3 school4 school5 school6 school7;
cards;
1 1 0 0 1 0 0 0
2 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 0 1 0 1 1 0
5 0 0 0 0 0 0 1
;

proc transpose
  data=county
  out=county_long (
    rename=(_name_=school)
    where=(col1 ne 0)
  )
;
by county;
var school:;
run;

which now enables a simple lookup, which can be done with joins, formats or, as demonstrated here, a hash:

data want;
set student_long;
by id;
if _n_ = 1
then do;
  length school $8;
  declare hash s (dataset:"county_long");
  s.definekey('county','school');
  s.definedone();
end;
retain indicator;
if first.id then indicator = .;
if s.check() = 0 then indicator = 1;
if last.id;
keep id indicator;
run;

If you want to have special values for your indicator, you can do that also in this data step with a SELECT block.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RichardDeVen · Posted 08-26-2020 08:19 AM

You can transpose each table into a tall form that allow an easier SQL join.

The join will use SQL summary functions such as SUM(logical-expression) and COUNT() to compute how many schools in county and out of county a student attended. SUM(logical-expression) is a 'trick' that leverages the logic result false=0 and true=1.

Example:

data student;
input id $ school1 school2 school3 school4 school5 school6 school7 county;
cards;
A 0 0 0 1 0 0 0 2
B 1 1 0 0 0 0 0 1
C 1 0 1 0 0 1 0 3
D 0 1 0 0 0 0 0 2
E 0 0 0 0 1 1 0 4
F 0 0 0 0 0 0 1 5
 ;
run;

data county;
input county school1 school2 school3 school4 school5 school6 school7;
cards;
1 1 0 0 1 0 0 0
2 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 0 1 0 1 1 0
5 0 0 0 0 0 0 1
;

proc transpose data=student out=students(where=(col1) rename=_name_=school);
  by id county;
  var school:;
  label school = ' '; * applied to output data set;
run;

proc transpose data=county out=counties(where=(col1) rename=_name_=school);
  by county;
  var school:;
  label school = ' '; * applied to output data set;
run;

proc sql;
  create table student_places as
  select
    students.id
  , sum(students.school eq counties.school) as schools_attended_in_county
  , count(distinct students.school) - calculated schools_attended_in_county as schools_attended_not_in_county
  from 
    students
  join 
     counties
  on 
     students.county = counties.county
  group by
    students.id
  ;

Output

mkeintz · Posted 08-26-2020 10:13 AM

You've gotten excellent advice from excellent advisors on transposing wide data to narrow data - in order to make subsequent linkages between the data sets. Just imagine what it would look like if you had, say, 700 schools.

But in cases where the datasets are large, and the powers-that-be declare that the wide data is the permanent arrangement, there can be substantial resources used in preparing the transposed intermediate data sets.

Here's a single-step solution that can be worth using in such cases:

data want (keep=id indicator);
  ** Make an array mapping schools to their counties **;
  if _n_=1 then do until (end_of_counties);
    set county end=end_of_counties;
    array sch_to_county {7} _temporary_;
    array _sch {7} school1-school7;
    do s=1 to dim(_sch);
      if _sch{s}=1 then sch_to_county{s}=county;
    end;
  end;
  ** Now read student counties and compare to the map above **;	  
  set student;
  schoolid=whichn(1,of school1-school7);
  indicator=  (county=sch_to_county{schoolid});
run;

Of course this depends on the fact that the school variable is an integer, thereby allowing easy use of arrays for mapping school to county.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

telc24 · Posted 08-26-2020 07:18 PM

@Kurt_Bremser , @mkeintz This is close to what I'm looking for, thank you! I tried running both of your codes, and the only issue that I'm having is that the indicator variable needs to have a range of values (i.e., in this case, they are allowed to range from 1-5). This is how I coded it in the "brute force" method.

I couldn't quite figure out how to edit your code to to get these values because I'm unfamiliar with the syntax (e.g., the "check()" in Kurt's code). If there are any resources you could point me to, that would be greatly appreciated!

PS- @Kurt_Bremser Thanks for catching the error in my example resulting dataset. I've modified it in my original post.

mkeintz · Posted 08-26-2020 08:07 PM

If indicator is 1 to 5, you need to tell us the rules. AFAICT, your rule was binary based on whether the student lives in the same county as his/her school (indicator =1) vs not the same county (indicator=0).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

telc24 · Posted 08-26-2020 08:12 PM

@mkeintz Apologies, I should have specified more clearly- the 'rules' for my indicator variable are in my "brute force" method. I've pasted them again below.

data student2; set student;

if county=1 and (school1=1 or school4=1) then indicator=1;

if county in (2 3) and (school2=1) then indicator=2;

if county=4 and (school3=1 or school5=1 or school6=1) then indicator=3;

if county=5 and (school7=1) then indicator=4;

run;

mkeintz · Posted 08-27-2020 01:20 AM

@telc24 wrote:

@mkeintz Apologies, I should have specified more clearly- the 'rules' for my indicator variable are in my "brute force" method. I've pasted them again below.

data student2; set student;

if county=1 and (school1=1 or school4=1) then indicator=1;

if county in (2 3) and (school2=1) then indicator=2;

if county=4 and (school3=1 or school5=1 or school6=1) then indicator=3;

if county=5 and (school7=1) then indicator=4;

run;

OK, I think I get it. You want an indicator for each unique set of schools found in the county dataset. You are apparently assuming that

A county can be served by a unique set of schools, none of which serve other schools.
or
Multiple counties may share a unique set of schools.
counties may not partially share schools (i.e. county A can't be served by schools 1,2, and 3 while county B is served by 1 and 2.

The program below should do:

data student;
input id $ school1 school2 school3 school4 school5 school6 school7 county;
cards;
A 0 0 0 1 0 0 0 2
B 1 1 0 0 0 0 0 1
C 1 0 1 0 0 1 0 3
D 0 1 0 0 0 0 0 2
E 0 0 0 0 1 1 0 4
F 0 0 0 0 0 0 1 5
 ;
run;

data county;
input county school1 school2 school3 school4 school5 school6 school7;
cards;
1 1 0 0 1 0 0 0
2 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 0 1 0 1 1 0
5 0 0 0 0 0 0 1
;


data want (keep=id indicator);

  /* Use a hash object to detect and map each unique pattern of schools */
  /* dummies found in the COUNTY data set to a unique INDICATOR value   */
  if _n_=1 then do c=1 by 1 until (end_of_counties);
    set county end=end_of_counties;
	length ones_and_zeroes $50;
    ones_and_zeroes=cats(of sch:);
    if c=1 then do;
       declare hash patterns();
       patterns.definekey('ones_and_zeroes');
       patterns.definedata('indicator');
       patterns.definedone();
    end;

	/* New pattern of zeros and ones?  Then add to hash */
	if patterns.find()^=0 then do;
	  indicator=patterns.num_items+1;
	  patterns.add();
	end;

	/* Map county to indicator (a proxy for school patterns) */
	array cty_ind  {5} _temporary_;
	cty_ind{county}=indicator;

	/* Map each school to indicator also */
    array sch_ind    {7} _temporary_;
    array _sch {7} school1-school7;
    do s=1 to dim(_sch);
      if _sch{s}=1 then sch_ind{s}=indicator;
    end;
  end;

  /** Now read student counties and compare to the map above **/
  set student;
  indicator=0;

  /* Scan over all attended schools until a non-zero indicator */
  /* is found, or no more schools to check                     */
  schoolid=whichn(1,of school1-school7) ;
  do while (schoolid^=0 and indicator=0);
    if cty_ind{county}=sch_ind{schoolid} then indicator=sch_ind{schoolid};
	if indicator=0 then _sch{schoolid}=0;
    schoolid=whichn(1,of school1-school7) ;  /*Go to next attended school */
  end;
run;

The hash object is just used to record all the unique school combinations found in COUNTIES, incrementing INDICATOR each time a new combination is encountered. This also provides a map of COUNTIES to the indicator, as well as a map of its constituent schools to the indicator.

Then read the students. If any the student's schools maps to the same indicator as the student's county does, then you have a non-zero indicator value.

Additional explanatory notes:

The “trick” here is to realize that distinct sets of schools serve distinct counties. So instead of determining whether a student attends any school that serves the same county as the student lives in, you actually want to know if the student attends a school that is a member of a set-of-schools that serves that student's county.

Of course, this works because you assume that each school belongs to only one distinct set-of-schools

process the county dataset first.

First you have to set up indicator variables corresponding to these sets-of-schools. As you discover a new set, assign INDICATOR=1, then 2, etc. You do this by reading through the county dataset. With each incoming record, generate the “pattern” of schools serving that county (i.e. concatenated ones and zeroes representing the school dummies). So if school 1 and 4 serves the incoming county (school1=1 and school4=1), and all the other school vars are zero, then the pattern of zeroes-and-ones is 1001000.

Once the pattern is in hand, see if it has already been encountered via hash find method, which will retrieve the corresponding INDICATOR. But if it is new (find method failed), then set the new INDICATOR value to 1 more than the number of known patterns. Store that value in the hash object with a lookup key of 1001000 (or whatever the new key is). Effectively INDICATOR is nothing more than a id variable for each distinct pattern of schools in the COUNTY dataset, in the order they are encountered.

So much for patterns as lookup for INDICATOR. Now

Put INDICATOR in the corresponding element of the county array CTY_IND.
And then loop over the school dummies in the county dataset. For any dummy=1 put INDICATOR in the corresponding school array (SCH_IND) element.

So much for the county dataset, now for the students:

Finally you’re ready to process the students dataset. For each incoming student, get the position of the leftmost school with dummy=1 (that's the whichn(1,of school1-school7) statement). If that school is a member of the same set of schools as serve the student’s county-of-residence, then the two array elements (sch_ind{schoolid) and cnt_ind{county}) will have the same indicator value, and you don't need to examine any more schools for that student. But if not, then set that school dummy to zero, so the next left-most can be found and tested.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

telc24 · Posted 08-27-2020 07:47 AM

@mkeintz Yes, this code works perfectly, thank you so much! Now I just need to sit down and digest how you wrote all of this 🙂

Thanks for your brilliant help!

telc24 · Posted 08-27-2020 10:02 PM

@mkeintz thank you for the additional notes!

I have another question. While my real data structure is similar to the example provided, I think there's one difference within my dataset that's causing an issue in the code.

The "county" variable in both the student and county datasets is actually a FIPS code (so it obviously doesn't simply range from 1-5). How would I treat that in the array? (specifically "array cty_ind {5}") What is the number that I should put in the bracket?

I tried a few different numbers within the bracket, including: (1) total number of unique FIPS codes (62); (2) maximum FIPS code (31123).

All options lead to a SAS error of "ERROR: Array subscript out of range at line 2314 column 5."

Happy to provide more details if helpful. Thank you so much for your help!!

mkeintz · Posted 08-28-2020 12:00 AM

If

FIPS code is a numeric variable, with positive integer values.
The max code is 31123,

then array size of 31123 should work.

So look at your SAS log - it's there to help you. What does the log report as the value of the FIPS code causing the problem?

If FIPS ever missing? That would be out of range.

Know thy data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

telc24 · Posted 08-28-2020 09:04 AM

@mkeintz Yes, thank you. The log tells me exactly what the FIPS code is. And that adjusted the code so it works.

I do have a tweak that I need to make to your code and have been playing around with it, but want to get your input. I'm happy to post this in a separate (new) post.

All of your assumptions are correct. The difference is that a student in a specific county (which has a unique set of schools) can attend ANY one of those schools in the "set" to get the indicator value. (Correct me if I'm wrong, but) currently, the code stipulates that the student in the county needs to attend ALL the schools in that unique set in order to get a non-zero indicator value.

I've been trying to edit the last portion of your code to (particularly the "if cty_ind{county}=sch_ind{schoolid} then" portion), but can't quite get it right.

mkeintz · Posted 08-28-2020 11:35 AM

@telc24 said:

Correct me if I'm wrong, but) currently, the code stipulates that the student in the county needs to attend ALL the schools in that unique set in order to get a non-zero indicator value.

And I responsded:

~~Take a closer look at the UNTIL expression, which stops the loop when a match is found. It only examines other schools when no prior school matches.~~

which provoked the question:

@mkeintz Are you suggesting to edit the "until" expression at the very beginning ("until (end_of_counties)")?

Oops. This is the consequence of me recalling my program by memory instead of looking at it again. I should have said to examine the "while" expression to determine whether the matching algorithm only requires that ONE of the student's school need to satisfy the matching criterion, vs all.

But of course, in addition to examining the code, the best way to answer such a question is to create a sample with 2 students, in which the first student attends multiple schools that all satisfy the match, and the second student has at least one that does not match, and one that does. I recommend this as a go-to technique to confirm or disprove my expectations.

Editted addition: take a look at student B who lives in county 1, and attends schools 1, 2, and 7. County 1 is served by schools 1 and 4.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Registration is open

SAS Training: Just a Click Away