01-05-2017 03:36 PM
So usually I try to solve something first before asking a question, but I really have no idea where to start with this one.
Basically, I am working with an aggregated data set of the number of students who are absent from school on a given day. The data is aggregated by school name, sex, age, grade, absence reason, and day. When a school doesn't have any absenteeism for a given sex, age, grade, absence reason, and day combination, there is simply no entry.
I need to create an entry for each missing combination of school name, sex, age, grade, absence reason, and day, where the number of students absent = 0.
It needs to be automated (this code would be run daily), and thus be able to identify which combinations are not present and then create them. All fields would be populated from existing fields, except for the number of absent students which would always be equal to 0.
I have no idea where to begin.
(Using SAS 9.4, if needed I can supply a sample dataset, it would just take some)
01-05-2017 04:01 PM
Since you are dealing with aggregate data and presumably should only have one record for each combination a possible relatively simple approach would be to make a base data set with all zero values and use either a merge or update to create a new combined data set.
But your "day" variable may or may not make that easy. Is it supposed to be the day of the week or a date? Or possibly the DAY could be provided differently. Example data would help.
Here is a rough example:
data base; do sname = 'Washington','Lincoln','Jefferson'; do grade = 1 to 6; do age = 5 to 10; absentee = 0; output; end; end; end; run; proc sort data=base; by sname grade age; run; data transaction; informat sname $10.; input sname grade age absentee; datalines; Washington 1 5 3 Washington 2 7 4 Washington 5 10 1 Lincoln 2 5 9 Lincoln 2 7 18 Lincoln 3 10 6 Jefferson 3 8 99 ; run; proc sort data=transaction; by sname grade age; run; data daily; update base transaction ; by sname grade age; run;
Note that the sets have to be sorted the same for update to work.
I appreciate that your combinations of age and grade are likely to be school dependent. Ensure your rules are enforced in the base data. One thing to be aware of is that if the school name spelling is not exactly the same in the transaction data set you have new schools in the results. Similarly new combinations of the values of other variables will be added to the result.
You could probably start on your BASE data set with a good enrollment count data set.
01-05-2017 04:22 PM - edited 01-05-2017 05:25 PM
Hmm, that is very helpful and a great place to start! Thank you.
The "day" variable is in DATETIME20. format, so mm/dd/yyyy 00:00:00 AM, but the time portion is always 12:00:00 AM. (Why they use datetime instead of just date, I'm not sure).
I wonder if there is a way to make a reference file for the school names instaed of listing them each out like that, because there are currently over 500 schools and more will be added in the future.
And yes, you are right, I will need to see which schools have which grades because they vary. I'm assuming that the
do sname = 'Washington','Lincoln','Jefferson'; do grade = 1 to 6; do age = 5 to 10; absentee = 0;
portion will make a combination of each school, age, and grade. I can play around with this, because tyically a given grade will have a range of 3 years of ages for example grade 6 should have age 5,6,7 etc.), so this may produce many entries/combinations that wouldn't exist in the actual data.
I would also need to incorporate the date in there. Of course I woulnd't need it for weekends so that may be a pain.
01-05-2017 06:20 PM
If you have a large dataset, which it sounds like you do, you can look at the SPARSE option in PROC FREQ which will generate all possible combinations including your 0's. It may be too many or too little but you can filter it out.
01-05-2017 06:38 PM
I work with my state's public school data a little bit. As far as the School and grade issues I would really hope that there is an ENROLLMENT data set somewhere, even last year's would help as a start. That should let you pull School and grade information with students actually enrolled.
Gender is no probem.
Age is likely grade dependent but I would hesitate to recommend a limits an ages. I would expect that there might be a guideline about that. I also wonder about age as component of this report as children's ages change during the year.
The date time value they send you can be easily reduced to just the date with the DATEPART function. The reason I consider most likely for the datetime is lazy database admin or programmer. If they aren't doing something like absent between 10am and 1Pm then the time component is useless. You don't need it, get rid of it.
If you can get separate datasets of some of these elements, such as your date you can combine them relatively easily. Here's an example:
data school; informat sname $10.; input sname grade; datalines; Washington 1 Washington 2 Washington 3 Jefferson 3 Jefferson 4 Jefferson 5 Jefferson 6 Lincoln 5 Lincoln 6 Lincoln 7 Lincoln 8 Adams 9 Adams 10 Adams 11 Adams 12 ; data sex; input sex $; datalines; Male Female ; run; proc sql; create table base as select sname,sex,grade from school, sex order by sname,grade,sex; quit;
You get a message from the Proc Sql along these lines:
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
That is basically what you are requesting combine everything and there really isn't a way to optimize that for this example.