Desktop productivity for business analysts and programmers

Re: Identifying persons missing and reappearing in longitudinal data

Reply
N/A
Posts: 0

Re: Identifying persons missing and reappearing in longitudinal data

I would like to identify the persons who have discontinued between 4 years of longtidutinal data

ID Wave1 Wave2 Wave3 Wave4
1 Australia Japan US UK
2 Australia Australia NA Australia
3 NA Australia Australia NA
4 NotJoin NotJoin NotJoin Australia
5 Australia NA NA US
6 Australia NA NA Japan

For example, persons with ID=1 Stay in Australia in First survey, Japan in Second Survey, US in 3rd Survey, UK in 4th Survey

However, ID=2 person joins in 1st Survey, but missing in 3rd Survey, reappear in 4th Survey

ID=3 persons started joining by 2nd Survey, ,maybe missing or die in 4th Survey

ID=4 just started to join by 4th Survey

Question:
In this case, I would like to identify the persons with ID=2 , 5 and 6


Any good advice in identifying the persons who miss out in between Surveys ? How to use program language to identify a database of 10000 persons ?
SAS Super FREQ
Posts: 8,820

Re: Identifying persons missing and reappearing in longitudinal data

Hi!
How you approach this problem really depends on whether all you want to do is identify the people or do further analysis with the subgroups that you identify. It also depends on your comfort level with SAS data step programming. Inside EG, you will have to open a code node to write your data step program. But before you write your program, you'll have to figure out the general approach.

For example, one approach might be to create a numeric variable that counts the number of years/waves that each person was missing. So for example, let's say you call the variable "missyr" and initially it is set to 0. If, for example, their observation for the wave values was:
[pre]
Australia Australia Japan Japan
0 + 0 + 0 + 0
[/pre]
then their value for missyr is 0 (because all 4 zeroes add up to 0)
but if their observation was:
[pre]
NotJoin Japan Japan NA
1 + 0 + 0 + 1
[/pre]
then their value for missyr = 2 (because all 4 values add up to 2).

Another approach is to set a flag about their status for every year (call these new variables "stat1", "stat2", "stat3" and "stat4" to match the 4 possibilities wave1 thru wave4):
[pre]
- means they were in a country
A means they were marked NA
J means they were marked NotJoin
[/pre]
Then, make a new variable that is essentially a "visual map" or "visual code" of all their years (call this variable "allstat", so when) the 4 are brought together:
[pre]
---- means they participated all 4 years
J--A means they were NotJoin in wave1, present in waves 2 and 3 and marked NA in wave4
-A-- means they were present in the wave1, NA in wave2, present in waves 3 and 4
[/pre]
There are several ways to create the various status flags and the "map" of their status across all the waves. I'd probably create a user-defined format and concatenate the individual status flags together to make this new variable. Then, once I had a value for the numeric value missyr and/or the status flags, I could run a series of PROC PRINTs to test the values as shown:
[pre]
proc format;
value $statfmt 'NA' = 'A'
'NotJoin' = 'J'
Other = '-';
run;

data wavedata;
drop i;
length allstat $4 stat1-stat4 $1;
** assumes vars from dataset are ID and wave1-wave4;
** define arrays for use;
array warr $ wave1-wave4;
array stat $ stat1-stat4;
/* set lib.origwave; <-- your SET stmt goes here */
missyr=0;
do i = 1 to 4;
** set a count for number of years missing;
if warr(i) in ("NA", "NotJoin") then missyr + 1;

** use the format to make each separate status flag;
stat(i) = put(warr(i),$statfmt.);
end;
** concatenate all 4 values for status together;
allstat=stat1||stat2||stat3||stat4;
run;

proc print data=wavedata;
title 'get any NotJoins';
where allstat contains 'J';
run;

proc print data=wavedata;
title 'get any NA folks';
where allstat contains 'A';
run;

proc print data=wavedata;
title 'get ONLY NA folks';
where allstat contains 'A' AND
allstat not contains 'J';
run;


proc print data=wavedata;
title 'get ONLY NotJoin folks';
where allstat contains 'J' AND
allstat not contains 'A';
run;

proc print data=wavedata;
title 'participated all 4 years';
where missyr = 0;
run;

proc print data=wavedata;
title 'present first year, but missed in other year';
where stat1 = '-' and missyr gt 0 ;
run;

proc print data=wavedata;
title 'present 2 consecutive years or 3 consec years';
where allstat contains '--' or allstat contains '---' ;
run;

proc print data=wavedata;
title 'combo NotJoin and NA';
where allstat contains 'J' and allstat contains 'A' ;
run;
[/pre]
The Proc Prints are pretty easy, but they require that you write a data step program to "massage" your data and create the various flags for testing. And then depending on the other kind of analysis you want to do, you have to decide whether to set the flags every time you need to run some kind of analysis or to make a permanent copy of the original data, but with these new variables added.

These are questions that only you can answer, depending on what further analysis you need to do. If all you need to do is identify the people for a one time report, then this approach may be overkill. But, one reason I use flags like the "allstat" flag or "missyr" or the "stat1" through "stat4" flags is that it simplifies what could become a very complicated set of WHERE conditions or IF conditions.

To understand what this program is doing, you have to read in the documentation about PROC FORMAT, DATA step programming, the ARRAY statement, the PUT function, the CONCATENATE operator, DO Loops, IF statements, PROC PRINT and WHERE statements. There is extensive information about all of these SAS language elements in the SAS documentation and SAS Technical Support is always a big help in figuring out usage issues.

Good luck with your analysis,
cynthia
Ask a Question
Discussion stats
  • 1 reply
  • 107 views
  • 0 likes
  • 2 in conversation