I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.
I have created a tabel as an example below
PNR is a personalID number,
each price and date are paired. In the real dataset i have another price lets call it price4 that is also paired with date 3. the populations is a number from 1-6 that tells me which comparisson group that person is in.
So what I want is to make sure that all duplicates that represent duplication due to merging will be removed, but without removing the real data. of cause this is per person 8I have more than 18 000 persons in this study
I imagined that I could do something like:
data all1;
set all;
if value(pnr)=value(pnr) and value(date1)=value(date1) and value(price1)=value(price1) then delete;
if value(pnr)=value(pnr) and value(date2)=value(date2) and value(price2)=value(price2) then delete;
if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price3)=value(price3) then delete;
if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price4)=value(price4) then delete;
run;
that obviously does not work. But is there a way I can this preferably in a datastep as I still need to sum the prices in different ways so that I can to my statistical analyses (proc glm) afterwards
How do I do this
Obs | PNR | Price1 | date1 | price2 | date2 | price3 | date3 | population |
1 | 887766 | 1500 | 1012014 | 45000 | 1012014 | 55000 | 1052018 | 1 |
2 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 55000 | 1052018 | 1 |
3 | 887766 | 1703 | 3042015 | 45000 | 2012014 | 55000 | 1052018 | 1 |
4 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 55000 | 1052018 | 1 |
5 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 5701 | 1052018 | 1 |
6 | 887755 | 3200 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
7 | 887755 | 8700 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
8 | 887755 | 0 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
9 | 887755 | 3300 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
10 | 887755 | 3200 | 4022018 | 241 | 3052017 | 7001 | 2032016 | 2 |
11 | 887755 | 3200 | 4022018 | 245761 | 3042017 | 0 | 2032017 | 2 |
12 | 887744 | 9763 | 1012014 | 99838 | 1032018 | 5056 | 10252018 | 5 |
13 | 887744 | 9700 | 1012015 | 99838 | 1032018 | 5056 | 10252018 | 5 |
14 | 887744 | 9763 | 1012014 | 99858 | 1032018 | 5056 | 10252018 | 5 |
15 | 887744 | 9763 | 1012014 | 55 | 1072016 | 5056 | 10252018 | 5 |
16 | 887733 | 3387 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
17 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
18 | 887733 | 3377 | 7162015 | 789 | 2032017 | 30050 | 12012014 | 3 |
19 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
20 | 887733 | 3377 | 7162015 | 1700 | 12052016 | 30050 | 12012014 | 3 |
21 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 308 | 12011014 | 3 |
22 | 887733 | 1500 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
23 | 887733 | 1500 | 7192015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
24 | 887722 | 4352 | 7152017 | 0 | 3042018 | 47531 | 11222014 | 6 |
25 | 887722 | 4372 | 7162017 | 0 | 3042018 | 47531 | 11222014 | 6 |
26 | 887722 | 5678 | 7152017 | 0 | 3042018 | 47531 | 11222014 | 6 |
27 | 887722 | 4352 | 7152017 | 750 | 3052018 | 47531 | 11222014 | 6 |
28 | 887722 | 4352 | 7152017 | 0 | 3042018 | 4700 | 11252014 | 6 |
In your previous thread "Making a long merged data set broad", it was suggested several times that you provide the data as a SAS data step, and examples were given, and instructions were given. This would speed up responses to your question tremendously, and you would receive faster and better answers.
I don't really glean from your description what really counts as a duplicate. I guess the explanation is not clear enough for me.
Did you try PROC SQL with the DISTINCT keyword?
dear paige. I did post the data step of my example in the previous thread and I had a good back and forwards with somebody, but I ended up with something that I could not do an analysis on. I am really trying my best but as my server is a secure server there are loads of restrictions on what I can write here when it comes to my data. and I can not practically or legally copy anything from my real dataset here. which is of cause giving me challenges. moreover that my dataset is huge is also providing an obstacle.
this was the data step from my example
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;
I have tried the proc sql which mr bremmer suggested in the previous post. I think I might not understand it very well, when there is loops in it. (which is something I never worked with before.)
maybe my problem is the way i merged the data?
data all;
merge lmdb sssy drga drgh lpr dtsdccg dod udda;
by pnr;
run;
the problem is that when I merge it has created duplicates in values of cost. So that one pnr (ID) can have the same date1 and cost1 as many times as there are different values in the other variables (even though it in reality only has been charged once). So if I want to create a variable the is the sum of all cost or cost for one specific year (per pnr (ID)) then it would be to high due to copies that showed up because of the merge. That is why i want to secure that all cost1 values that are
tied to the same date and pnr are different, otherwise they are a duplicate.
But maybe this problem can be solved already in the merge statement?
Hello @Jannie_D
You are the best person to select the algorithm and write the code to remove duplicates as per your business logic. You have have access to and better understanding of your environment and data.
I understand is not possible to either upload the full data or code from your workplace.It is the intellectual property of your employer. I can understand the stress you are in.
There is a lot in the literature on removing duplicates. This paper is a good start https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf.
I always keep my copy of the Little SAS Book around. It is the first place to look for a solution.
@Jannie_D wrote:
I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.
I have created a tabel as an example below
I would suggest working with subsets from all of the "several big databases" on the process used to merge to prevent the duplication.
A cleaned up version will likely be more efficient in execution as well as removing duplicates. Fixing afterward means that you are wasting resources, time and disk use.
Break the "merge" into small steps, examine the result of each for duplicates then fix the code for that step to prevent duplicates. Then add in the next of the "several" databases. Rinse, Repeat.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.