12-22-2016 06:31 AM
I'm fairly new to SAS so not sure if something like this is possible even.
I'm working with a dataset of over 1,000,000 individuals. I've come to notice that I have duplicate individuals within my dataset, however, the information which is contained within the variables are different. I wonder thus is it possible to take the second reference of an individual and create x number of new variables with that individual.
Thus I would no longer have dupilicate individuals but instead extra variables for a certain individual.
I've attached an image of what I'm trying to do which prehaps explains things better than what I've written.
Any help would be greatly appreciated.
12-22-2016 06:51 AM
You can absolutely do this in SAS, but requires som e data step programming. If you haven't acquired that skill yet, I strongly recommend that you take the frees Programming 1 online training.
Also, creating data structures like this is typically awkward. it requires that the analysis/data management steps to follow to take those anomalies into account. So i would strive for to clean the data "once and for all" - determine the golden record/fields.
That said, use a data step with BY, RETAIN and conditional assignment statements. Also, you need an explicit OUTPUT statement at the end of the BY group (hint: last.)
12-22-2016 07:04 AM
Are you sure you only have duplicates and not triplicates or other multiple variations?
How useful are the repetive records/variables?
Otherwise consider marking the first recorded so you can easily filter it out in future processes, but leaving your data structure the same otherwise.
12-22-2016 07:07 AM
Try next code assumed data sorted by ID and there are x1-x4 variables all of same type
either numeric or alphanumeric;
set have(rename=(x1=v1 x2=v2 x3=v3 x4=v4));
reatain i x1 - x8;
array vn v1=v4;
array xn x1-x8; /* x5-x8 instead x1a-x4a */
if first.id then i=1;
do j=1 to 4;
xv(i) = vn(j);
i+1; /* prepared for next row of same ID */
if last.id then output;
keep id x1-x8;
12-22-2016 07:12 AM
To add to the great advice above, what I would do in this situation is to normalise the data. To do this you would go from:
ID X1 X2 X3 X4
1 100 8 6 66
1 101 112 4 6
ID PARAMETER RESULT
1 X1 100
1 X2 8
1 X3 6
1 X4 6
1 X1 101
proc transpose data=have out=want; by id; var x1--x4; run;
It is then a very simple task to find duplicates within an id:
proc sort data=have out=want dupout=duplicates nodupkey; by id parameter result; run;
This will give you a dataset with unique id/param/result records. Now how you go about shrinking that down to unique id/param is down to th logic you want to apply, maybe it is the highest value, the min, the average, the first etc. You haven't provided this info so I can't say.
Once you have the unique id/param, you can then transpose the data up again to get a final dataset
12-22-2016 08:45 AM
I would suggest two steps that are somewhat different than what you are suggesting. First, find out the extent of the problem. That can be done using:
proc freq data=have;
tables individual / noprint out=counts;
proc freq data=counts;
The table produced by the second PROC FREQ shows you how many individuals have 1 observation in your data set, how many have 2 observations, etc.
Secondly, instead of creating extra variables, dump the duplicates into a second data set until you figure out what you want to do with them. For example:
proc sort data=have;
data first_one extras;
if first.individual then output first_one;
else output extras;
You can easily keep the duplicate information (separately) without needing to create any extra variables. Once you figure out the strategy for dealing with duplicates, you can plan a more detailed program.
12-22-2016 10:09 AM
to check the dupliactes better do:
if not (first.id and last.id);
then observe dup dataset and dceide how to proceed;
12-22-2016 06:31 PM
That sounds like a data quality issue. Or do you have versions in your table? Is there something like a version_nr or date variable(s) in your data?
From a data modelling perspective I'd go for a two table approach. Store the best quality data or latest version in your main table, store all other versions in a second table. This way you can have as many versions as you like.