Re: Checking whether there is a new obs within a group

braam · Posted 08-07-2020 04:17 PM

Is there a way to check whether there is a new person in my grouped data?

HospitalID PersonID Year

1 1 2005

1 2 2005

1 3 2005

1 1 2006

1 2 2006

1 3 2006

1 4 2006

2 5 2005

2 6 2005

2 7 2005

2 5 2006

2 6 2006

2 8 2006

3 9 2005

3 10 2005

3 9 2006

3 10 2006

As shown above, there were 1, 2, and 3 in hospital A in year 2005. There is a new person, 4, in year 2006. In this case, I would like to give a value of 1 to observations of hospital 1 and year 2006 (i.e., 1 for the observations of groups where there is a new person).

In hospital 2, there are 5, 6, and 7 in year 2005. But next year, 7 left while 8 joined. In this case, I would like to assign missing to the observations of this group (i.e., hospital 2 - year 2006).

In hospital 3, there are 9 and 10 in years 2005 and 2006. For the observations of this group, I would like to assign 0 to the observations of this group.

The resulting dataset would be either at hospital-year level or at hospital-person-yearl level, whichever is OK for me. Is there a simple way to make this dataset? Thanks in advance!

novinosrin · Posted 08-07-2020 04:30 PM

HI @braam In addition to the sentences that explains the logic, can you please post the expected output for the input sample posted. Thank you!

ballardw · Posted 08-07-2020 05:11 PM

An important question is how do you know which are the "old" observations?

Do you have a data set with the old groups in it somewhere?

braam · Posted 08-07-2020 05:26 PM

I meant, old observations are those in the previous year.
Let's say, observations in year 2005 serve as "old" observations for those in year 2006.

ballardw · Posted 08-07-2020 05:40 PM

@braam wrote:
I meant, old observations are those in the previous year.
Let's say, observations in year 2005 serve as "old" observations for those in year 2006.

Previous to what? What in the data tells what is a "new" obs?

Listing an example does not provide a rule. You know your data, presumably, and so can tell. But we don't and to program a process we need to know the rule(s) involved that tell us what is new.

Since all of the example years you have shown so far are more than 10 years old compared to the current date everything looks "old".

So, are you talking about the latest year in the data for a hospital is "new"? or Hospital PersonID the latest year is the "new"?

Or is a specific year, one that you designate in the code for example, to be considered the "new" year?

braam · Posted 08-07-2020 05:25 PM

Thanks for your reply. I added some blank lines between different hospitals to read it easily.

HospitalID PersonID Year

1 1 2005

1 2 2005

1 3 2005

1 1 2006

1 2 2006

1 3 2006

1 4 2006

2 5 2005

2 6 2005

2 7 2005

2 5 2006

2 6 2006

2 8 2006

3 9 2005

3 10 2005

3 9 2006

3 10 2006

My expected output is:

HopsitalID Year NewPerson

1 2005 .

1 2006 1

2 2005 .

2 2006 .

3 2005 .

3 2006 0

For the first hospital, there is a clearly new person in year 2006. So 1 would be assigned to NewPerson in year 2006.

For the second hospital, there is a new person coming in and another outgoing. So missing would be assigned to NewPerson in year 2006.

For the last hospital, there is no change. So 0 would be assigned to NewPerson in 2006.

Just so you know, I have a much bigger panel data. Thanks again.

novinosrin · Posted 08-07-2020 07:41 PM

So can we assume the following the rules-

1. If there's a change i.e. an addition of 1 or more patients in subsequent years without any drop in existing patients, then NewPerson flag is 1

2. If there's a change whereby the situation reflects drop in some existing patients and concurrently there are some additional new patients, then NewPerson flag is missing

3. If there's NO change, then NewPerson flag is 0

Can you please review and let us know all scenarios to account for?

braam · Posted 08-08-2020 02:50 AM

Thanks for your clarification. Your description is what I'm looking for.

novinosrin · Posted 08-08-2020 03:17 PM

Hi @braam Your clarification helps. Thanks! Please try the below and see if it works-


data have;
input HospitalID PersonID Year;
cards;
1 1 2005

1 2 2005

1 3 2005

1 1 2006

1 2 2006

1 3 2006

1 4 2006

2 5 2005

2 6 2005

2 7 2005

2 5 2006

2 6 2006

2 8 2006

3 9 2005 

3 10 2005

3 9 2006 

3 10 2006
;

data want;
 if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("PersonID") ;
  h.definedone () ;
 end;
 do _n_=1 by 1 until(last.HospitalID);
  h_items=h.num_items;
  do until(last.year);
   set have;
   by HospitalID year;
   if _n_>1 then if  h.check()=0 then h_items=h_items-1;
   else NewPerson=1;
   h.replace();
  end;
  if _n_>1 then if  h_items then NewPerson=.;
  else if NewPerson=. then NewPerson=0; 
  output;
 end;
 h.clear();
 keep HospitalID year NewPerson;
run;

mkeintz · Posted 08-08-2020 12:40 AM

Let's restate the objective, to: For each hospital create 3 variables N_KEPT (number of persons in both 2005 and 2006), N_ADDED (Number in 2006 and not in 2005), and N_DROPPED (you can guess).

And it just so happens I have suggested code for my suggested redefinition of the task:

data want (keep=hospitalid n_: );
  merge have (where=(year=2005) in=in2005)
        have (where=(year=2006) in=in2006);
  by hospitalid personid;

  if in2005=in2006 then n_kept+1;
  else if in2006=0 then n_dropped+1;
  else if in2005=0 then n_added+1;

  if last.hospitalid;
  output;
  call missing(of n_:);
run;

This program assumes you data are sorted by hospitalid/year/personid.

The reason there is a call missing is to clear out the accumulations of N_dropped, N_added, N_kept at the end of one hospital, before starting the next hospital. And because there is an explicit call missing of these news variables, it has to be preceded by an explicit OUTPUT statement. Otherwise the implicit output action just before the RUN statement would be outputting missing values instead of the calculated totals.

Question: do you really just have the same two years for all hospitals?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!