Re: How to conditionally remove duplicates.

Jannie_D · Posted 05-07-2021 05:38 AM

I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.

I have created a tabel as an example below

PNR is a personalID number,

each price and date are paired. In the real dataset i have another price lets call it price4 that is also paired with date 3. the populations is a number from 1-6 that tells me which comparisson group that person is in.

So what I want is to make sure that all duplicates that represent duplication due to merging will be removed, but without removing the real data. of cause this is per person 8I have more than 18 000 persons in this study

I imagined that I could do something like:

data all1;

set all;

if value(pnr)=value(pnr) and value(date1)=value(date1) and value(price1)=value(price1) then delete;

if value(pnr)=value(pnr) and value(date2)=value(date2) and value(price2)=value(price2) then delete;

if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price3)=value(price3) then delete;

if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price4)=value(price4) then delete;

run;

that obviously does not work. But is there a way I can this preferably in a datastep as I still need to sum the prices in different ways so that I can to my statistical analyses (proc glm) afterwards

How do I do this

Obs	PNR	Price1	date1	price2	date2	price3	date3	population
1	887766	1500	1012014	45000	1012014	55000	1052018	1
2	887766	1703	3042015	45000	1012014	55000	1052018	1
3	887766	1703	3042015	45000	2012014	55000	1052018	1
4	887766	1703	3042015	45000	1012014	55000	1052018	1
5	887766	1703	3042015	45000	1012014	5701	1052018	1
6	887755	3200	4022018	245761	3042017	7001	2032016	2
7	887755	8700	4022018	245761	3042017	7001	2032016	2
8	887755	0	4022018	245761	3042017	7001	2032016	2
9	887755	3300	4022018	245761	3042017	7001	2032016	2
10	887755	3200	4022018	241	3052017	7001	2032016	2
11	887755	3200	4022018	245761	3042017	0	2032017	2
12	887744	9763	1012014	99838	1032018	5056	10252018	5
13	887744	9700	1012015	99838	1032018	5056	10252018	5
14	887744	9763	1012014	99858	1032018	5056	10252018	5
15	887744	9763	1012014	55	1072016	5056	10252018	5
16	887733	3387	7162015	45000	2032016	30050	12012014	3
17	887733	3377	7162015	45000	2032016	30050	12012014	3
18	887733	3377	7162015	789	2032017	30050	12012014	3
19	887733	3377	7162015	45000	2032016	30050	12012014	3
20	887733	3377	7162015	1700	12052016	30050	12012014	3
21	887733	3377	7162015	45000	2032016	308	12011014	3
22	887733	1500	7162015	45000	2032016	30050	12012014	3
23	887733	1500	7192015	45000	2032016	30050	12012014	3
24	887722	4352	7152017	0	3042018	47531	11222014	6
25	887722	4372	7162017	0	3042018	47531	11222014	6
26	887722	5678	7152017	0	3042018	47531	11222014	6
27	887722	4352	7152017	750	3052018	47531	11222014	6
28	887722	4352	7152017	0	3042018	4700	11252014	6

PaigeMiller · Posted 05-07-2021 06:50 AM

In your previous thread "Making a long merged data set broad", it was suggested several times that you provide the data as a SAS data step, and examples were given, and instructions were given. This would speed up responses to your question tremendously, and you would receive faster and better answers.

I don't really glean from your description what really counts as a duplicate. I guess the explanation is not clear enough for me.

Did you try PROC SQL with the DISTINCT keyword?

--
Paige Miller

Jannie_D · Posted 05-07-2021 07:53 AM

dear paige. I did post the data step of my example in the previous thread and I had a good back and forwards with somebody, but I ended up with something that I could not do an analysis on. I am really trying my best but as my server is a secure server there are loads of restrictions on what I can write here when it comes to my data. and I can not practically or legally copy anything from my real dataset here. which is of cause giving me challenges. moreover that my dataset is huge is also providing an obstacle.

this was the data step from my example

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;

I have tried the proc sql which mr bremmer suggested in the previous post. I think I might not understand it very well, when there is loops in it. (which is something I never worked with before.)

maybe my problem is the way i merged the data?

data all;

merge lmdb sssy drga drgh lpr dtsdccg dod udda;

by pnr;

run;

the problem is that when I merge it has created duplicates in values of cost. So that one pnr (ID) can have the same date1 and cost1 as many times as there are different values in the other variables (even though it in reality only has been charged once). So if I want to create a variable the is the sum of all cost or cost for one specific year (per pnr (ID)) then it would be to high due to copies that showed up because of the merge. That is why i want to secure that all cost1 values that are

tied to the same date and pnr are different, otherwise they are a duplicate.

But maybe this problem can be solved already in the merge statement?

Sajid01 · Posted 05-07-2021 09:34 AM

Hello @Jannie_D
You are the best person to select the algorithm and write the code to remove duplicates as per your business logic. You have have access to and better understanding of your environment and data.
I understand is not possible to either upload the full data or code from your workplace.It is the intellectual property of your employer. I can understand the stress you are in.
There is a lot in the literature on removing duplicates. This paper is a good start https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf.
I always keep my copy of the Little SAS Book around. It is the first place to look for a solution.

ballardw · Posted 05-07-2021 10:46 AM

@Jannie_D wrote:

I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.

I have created a tabel as an example below

I would suggest working with subsets from all of the "several big databases" on the process used to merge to prevent the duplication.

A cleaned up version will likely be more efficient in execution as well as removing duplicates. Fixing afterward means that you are wasting resources, time and disk use.

Break the "merge" into small steps, examine the result of each for duplicates then fix the code for that step to prevent duplicates. Then add in the next of the "several" databases. Rinse, Repeat.

How to conditionally remove duplicates.