03-11-2016 02:14 PM
I am working with personnel records for more than 3 million people over a 30 year period of time. The records are formatted as 1 event per person per row, so some people have as few as 3 or 4 rows, while other people have as many as 120+ rows. There are 108 variables.
SSN is the primary BY-group processing variable. Across an individual’s records, event_code is used to distinguish the type of events (accession, pay change, re-enlist, duty station change, and discharge).
I am trying to figure out the best way to work with these records to accomplish a few different tasks, such as:
Each task requires more than 1 logical condition to be met, so it seems like my options are (1) multiple DO Loops and (2) a DoW Loop approach.
Is this correct? Or is there another way to accomplish these tasks?
03-11-2016 05:06 PM
Some things will depend on the actual layout of your data. It sounds like you may be attempting to do this in one pass of the data. That might be a whole lot of work when separate tasks may make more sense.
For instance task 1 might start with
Proc sort data=have (where=(event in ("Accession" "Discharge")) out=temp;
by ssn date;
This gets the specific events of interest into dataset and if there are no problems then you should have alternating accession and discharge dates. So some by group processing and comparison of events in sequence (lag function ) would find problems. There is like to be some fancy SQL approach but since I detect military records there are likely to be individuals with multiple accession and discharge.
Task 2 would be 1) select records with event =accession and date in the desired range (careful of multiples)
2) join or difference on a subset by SSN of events=discharge and date in the desired range
create table want as
where location in (<list of locations or code>) or position in (<list of locations or codes>);
Look up the syntax for IN , likely you'd see something like: Location in ("Fort Bragg" "Fort Hood")
Better would be a dataset of locations and/or positions and then do a Left join.
03-11-2016 05:43 PM
These are good suggestions.
It seems like with the DOW loop (or more specifically, a double DOW loop - see Paper PH08-2011 "Using a Double DOW Loop to Compute Progression Free Survival" by Matthew Nizol), it is possible to subject the BY-processing groups to multiple conditional/logic statements and generate new variables (to identify records that meet all of the conditions) at the same time.
If I change the date to really wide records (1 patient = 1 row), I can do the same thing with arrays - but given the number of events in many of these records, working with really wide records just seems cumbersome (and chews up so much more space).
But if working with long records (1 event per row) means that I have to do multiple data steps to get to the same point as one data step with really wide records, then it might make more sense to change the data layout.
03-13-2016 03:40 PM