Desktop productivity for business analysts and programmers

Duplicate individuals information added to first individual reference

Reply
Contributor
Posts: 35

Duplicate individuals information added to first individual reference

Folks,

 

I'm fairly new to SAS so not sure if something like this is possible even. 

I'm working with a dataset of over 1,000,000 individuals. I've come to notice that I have duplicate individuals within my dataset, however, the information which is contained within the variables are different. I wonder thus is it possible to take the second reference of an individual and create x number of new variables with that individual.

 

Thus I would no longer have dupilicate individuals but instead extra variables for a certain individual. 

 

I've attached an image of what I'm trying to do which prehaps explains things better than what I've written.

 

Any help would be greatly appreciated. 

Attachment
Esteemed Advisor
Posts: 5,199

Re: Duplicate individuals information added to first individual reference

You can absolutely do this in SAS, but requires som e data step programming. If you haven't acquired that skill yet, I strongly recommend that you take the frees Programming 1 online training.

 

Also, creating data structures like this is typically awkward. it requires that the analysis/data management steps to follow to take those anomalies into account. So i would strive for to clean the data "once and for all" - determine the golden record/fields.

 

That said, use a data step with BY, RETAIN and conditional assignment statements. Also, you need an explicit OUTPUT statement at the end of the BY group (hint: last.)

Data never sleeps
Grand Advisor
Posts: 17,427

Re: Duplicate individuals information added to first individual reference

Are you sure you only have duplicates and not triplicates or other multiple variations? 

 

How useful are the repetive records/variables? 

Otherwise consider marking the first recorded so you can easily filter it out in future processes, but leaving your data structure the same otherwise. 

Super User
Posts: 1,244

Re: Duplicate individuals information added to first individual reference

Try next code assumed data sorted by ID and there are x1-x4 variables all of same type

either numeric or alphanumeric;

 

data want;

 set have(rename=(x1=v1 x2=v2 x3=v3 x4=v4));

   by ID;

       reatain  i x1 - x8;

       array vn v1=v4;

       array xn x1-x8;   /* x5-x8 instead x1a-x4a */

       if first.id then i=1;

       do j=1 to 4;

            xv(i) = vn(j);

            i+1;

      end;

      i+1;   /* prepared for next row of same ID */

     if last.id then output;

     keep id x1-x8;

run;

        

Esteemed Advisor
Esteemed Advisor
Posts: 7,249

Re: Duplicate individuals information added to first individual reference

To add to the great advice above, what I would do in this situation is to normalise the data.  To do this you would go from:
ID    X1    X2   X3   X4

1     100   8     6     66

1     101   112  4     6

 

To this:

ID   PARAMETER    RESULT

1     X1                   100

1     X2                   8

1     X3                   6

1     X4                   6 

1     X1                   101

...

 

Something like:

proc transpose data=have out=want;
  by id;
  var x1--x4;
run;

 

It is then a very simple task to find duplicates within an id:

proc sort data=have out=want dupout=duplicates nodupkey;
  by id parameter result;
run;

This will give you a dataset with unique id/param/result records.  Now how you go about shrinking that down to unique id/param is down to th logic you want to apply, maybe it is the highest value, the min, the average, the first etc.  You haven't provided this info so I can't say.   

Once you have the unique id/param, you can then transpose the data up again to get a final dataset

Respected Advisor
Posts: 4,998

Re: Duplicate individuals information added to first individual reference

I would suggest two steps that are somewhat different than what you are suggesting.  First, find out the extent of the problem.  That can be done using:

 

proc freq data=have;

tables individual / noprint out=counts;

run;

proc freq data=counts;

tables count;

run;

 

The table produced by the second PROC FREQ shows you how many individuals have 1 observation in your data set, how many have 2 observations, etc.

 

Secondly, instead of creating extra variables, dump the duplicates into a second data set until you figure out what you want to do with them.  For example:

 

proc sort data=have;

   by individual;

run;

 

data first_one extras;

   set have;

   by individual;

   if first.individual then output first_one;

   else output extras;

run;

 

You can easily keep the duplicate information (separately) without needing to create any extra variables.  Once you figure out the strategy for dealing with duplicates, you can plan a more detailed program.

Super User
Posts: 1,244

Re: Duplicate individuals information added to first individual reference

to check the dupliactes better do:

 

data dup;

 set have;

  by Id;

      if not (first.id and last.id);

run;

 

then observe dup dataset and dceide how to proceed;

Respected Advisor
Posts: 3,840

Re: Duplicate individuals information added to first individual reference

@Sean_OConnor

That sounds like a data quality issue. Or do you have versions in your table? Is there something like a version_nr or date variable(s) in your data?

 

 

From a data modelling perspective I'd go for a two table approach. Store the best quality data or latest version in your main table, store all other versions in a second table. This way you can have as many versions as you like.

Ask a Question
Discussion stats
  • 7 replies
  • 260 views
  • 1 like
  • 7 in conversation