Working with longitudinal data to do by-group processing with multiple logical conditions?

Reply
Contributor SM1
Contributor
Posts: 30

Working with longitudinal data to do by-group processing with multiple logical conditions?

Hi everyone,

 

I am working with personnel records for more than 3 million people over a 30 year period of time. The records are formatted as 1 event per person per row, so some people have as few as 3 or 4 rows, while other people have as many as 120+ rows. There are 108 variables.

 

SSN is the primary BY-group processing variable. Across an individual’s records, event_code is used to distinguish the type of events (accession, pay change, re-enlist, duty station change, and discharge).

 

I am trying to figure out the best way to work with these records to accomplish a few different tasks, such as:

  1. QA checks (e.g., by SSN, examine records to confirm the accession date was before the discharge date)
  2. Identify working cohorts (e.g., by SSN, select those individuals with a valid accession date between 1980 – 1985 and no discharge date until 1986 or later);
  3. Identify individuals who ever worked in a specific location or ever had a specific job position (both location and position can change multiple time across an individual’s career).

Each task requires more than 1 logical condition to be met, so it seems like my options are (1) multiple DO Loops and (2) a DoW Loop approach.

 

Is this correct? Or is there another way to accomplish these tasks?

 

Thanks!

Grand Advisor
Posts: 10,204

Re: Working with longitudinal data to do by-group processing with multiple logical conditions?

Some things will depend on the actual layout of your data. It sounds like you may be attempting to do this in one pass of the data. That might be a whole lot of work when separate tasks may make more sense.

 

For instance task 1 might start with

Proc sort data=have (where=(event  in ("Accession" "Discharge")) out=temp;

by ssn date;

run;

 

This gets the specific events of interest into dataset and if there are no problems then you should have alternating accession and discharge dates. So some by group processing and comparison of events in sequence (lag function ) would find problems. There is like to be some fancy SQL approach but since I detect military records there are likely to be individuals with multiple accession and discharge.

 

Task 2 would be 1) select records with event =accession and date in the desired range (careful of multiples)

                            2) join or difference on a subset by SSN of events=discharge and date in the desired range

                          

Task 3

proc sql;

   create table want as

   select *

   from have

   where location in (<list of locations or code>) or position in (<list of locations or codes>);

quit;

 

Look up the syntax for IN , likely you'd see something like: Location in ("Fort Bragg" "Fort Hood")

 

Better would be a dataset of locations and/or positions and then do a Left join.

 

 

 

Contributor SM1
Contributor
Posts: 30

Re: Working with longitudinal data to do by-group processing with multiple logical conditions?

These are good suggestions.

 

It seems like with the DOW loop (or more specifically, a double DOW loop - see Paper PH08-2011 "Using a Double DOW Loop to Compute Progression Free Survival" by Matthew Nizol), it is possible to subject the BY-processing groups to multiple conditional/logic statements and generate new variables (to identify records that meet all of the conditions) at the same time.

 

If I change the date to really wide records (1 patient = 1 row), I can do the same thing with arrays - but given the number of events in many of these records, working with really wide records just seems cumbersome (and chews up so much more space).

 

But if working with long records (1 event per row) means that I have to do multiple data steps to get to the same point as one data step with really wide records, then it might make more sense to change the data layout.

Grand Advisor
Posts: 17,320

Re: Working with longitudinal data to do by-group processing with multiple logical conditions?

2/3 are straightforward in a single data query in my opinion.

For 1, because you're looking for specific states in comparison it does lend itself to a wide format. Maintaining the data in that format long term isn't the same as using that form for your data verifications though.
Ask a Question
Discussion stats
  • 3 replies
  • 271 views
  • 0 likes
  • 3 in conversation