BookmarkSubscribeRSS Feed
Sean_OConnor
Fluorite | Level 6

Folks,

 

I'm fairly new to SAS so not sure if something like this is possible even. 

I'm working with a dataset of over 1,000,000 individuals. I've come to notice that I have duplicate individuals within my dataset, however, the information which is contained within the variables are different. I wonder thus is it possible to take the second reference of an individual and create x number of new variables with that individual.

 

Thus I would no longer have dupilicate individuals but instead extra variables for a certain individual. 

 

I've attached an image of what I'm trying to do which prehaps explains things better than what I've written.

 

Any help would be greatly appreciated. 


Example.PNG
7 REPLIES 7
LinusH
Tourmaline | Level 20

You can absolutely do this in SAS, but requires som e data step programming. If you haven't acquired that skill yet, I strongly recommend that you take the frees Programming 1 online training.

 

Also, creating data structures like this is typically awkward. it requires that the analysis/data management steps to follow to take those anomalies into account. So i would strive for to clean the data "once and for all" - determine the golden record/fields.

 

That said, use a data step with BY, RETAIN and conditional assignment statements. Also, you need an explicit OUTPUT statement at the end of the BY group (hint: last.)

Data never sleeps
Reeza
Super User

Are you sure you only have duplicates and not triplicates or other multiple variations? 

 

How useful are the repetive records/variables? 

Otherwise consider marking the first recorded so you can easily filter it out in future processes, but leaving your data structure the same otherwise. 

Shmuel
Garnet | Level 18

Try next code assumed data sorted by ID and there are x1-x4 variables all of same type

either numeric or alphanumeric;

 

data want;

 set have(rename=(x1=v1 x2=v2 x3=v3 x4=v4));

   by ID;

       reatain  i x1 - x8;

       array vn v1=v4;

       array xn x1-x8;   /* x5-x8 instead x1a-x4a */

       if first.id then i=1;

       do j=1 to 4;

            xv(i) = vn(j);

            i+1;

      end;

      i+1;   /* prepared for next row of same ID */

     if last.id then output;

     keep id x1-x8;

run;

        

RW9
Diamond | Level 26 RW9
Diamond | Level 26

To add to the great advice above, what I would do in this situation is to normalise the data.  To do this you would go from:
ID    X1    X2   X3   X4

1     100   8     6     66

1     101   112  4     6

 

To this:

ID   PARAMETER    RESULT

1     X1                   100

1     X2                   8

1     X3                   6

1     X4                   6 

1     X1                   101

...

 

Something like:

proc transpose data=have out=want;
  by id;
  var x1--x4;
run;

 

It is then a very simple task to find duplicates within an id:

proc sort data=have out=want dupout=duplicates nodupkey;
  by id parameter result;
run;

This will give you a dataset with unique id/param/result records.  Now how you go about shrinking that down to unique id/param is down to th logic you want to apply, maybe it is the highest value, the min, the average, the first etc.  You haven't provided this info so I can't say.   

Once you have the unique id/param, you can then transpose the data up again to get a final dataset

Astounding
PROC Star

I would suggest two steps that are somewhat different than what you are suggesting.  First, find out the extent of the problem.  That can be done using:

 

proc freq data=have;

tables individual / noprint out=counts;

run;

proc freq data=counts;

tables count;

run;

 

The table produced by the second PROC FREQ shows you how many individuals have 1 observation in your data set, how many have 2 observations, etc.

 

Secondly, instead of creating extra variables, dump the duplicates into a second data set until you figure out what you want to do with them.  For example:

 

proc sort data=have;

   by individual;

run;

 

data first_one extras;

   set have;

   by individual;

   if first.individual then output first_one;

   else output extras;

run;

 

You can easily keep the duplicate information (separately) without needing to create any extra variables.  Once you figure out the strategy for dealing with duplicates, you can plan a more detailed program.

Shmuel
Garnet | Level 18

to check the dupliactes better do:

 

data dup;

 set have;

  by Id;

      if not (first.id and last.id);

run;

 

then observe dup dataset and dceide how to proceed;

Patrick
Opal | Level 21

@Sean_OConnor

That sounds like a data quality issue. Or do you have versions in your table? Is there something like a version_nr or date variable(s) in your data?

 

 

From a data modelling perspective I'd go for a two table approach. Store the best quality data or latest version in your main table, store all other versions in a second table. This way you can have as many versions as you like.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 899 views
  • 1 like
  • 7 in conversation