BookmarkSubscribeRSS Feed
Jannie_D
Calcite | Level 5

I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.

I have created a tabel as an example below

PNR is a personalID number,

each price and date are paired. In the real dataset i have another price lets call it price4 that is also paired with date 3. the populations is a number from 1-6 that tells me which comparisson group that person is in.

 

So what I want is to make sure that all duplicates that represent duplication due to merging will be removed, but without removing the real data. of cause this is per person 8I have more than 18 000 persons in this study

 

I imagined that I could do something like: 

data all1;

set all;

if value(pnr)=value(pnr) and value(date1)=value(date1) and value(price1)=value(price1) then delete;

if value(pnr)=value(pnr) and value(date2)=value(date2) and value(price2)=value(price2) then delete;

if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price3)=value(price3) then delete;

if value(pnr)=value(pnr) and value(date3)=value(date3) and value(price4)=value(price4) then delete;

run;

 

that obviously does not work. But is there a way I can this preferably in a datastep as I still need to sum the prices in different ways so that I can to my statistical analyses (proc glm) afterwards

 

How do I do this

 

Obs

PNR

Price1

date1

price2

date2

price3

date3

population

1

887766

1500

1012014

45000

1012014

55000

1052018

1

2

887766

1703

3042015

45000

1012014

55000

1052018

1

3

887766

1703

3042015

45000

2012014

55000

1052018

1

4

887766

1703

3042015

45000

1012014

55000

1052018

1

5

887766

1703

3042015

45000

1012014

5701

1052018

1

6

887755

3200

4022018

245761

3042017

7001

2032016

2

7

887755

8700

4022018

245761

3042017

7001

2032016

2

8

887755

0

4022018

245761

3042017

7001

2032016

2

9

887755

3300

4022018

245761

3042017

7001

2032016

2

10

887755

3200

4022018

241

3052017

7001

2032016

2

11

887755

3200

4022018

245761

3042017

0

2032017

2

12

887744

9763

1012014

99838

1032018

5056

10252018

5

13

887744

9700

1012015

99838

1032018

5056

10252018

5

14

887744

9763

1012014

99858

1032018

5056

10252018

5

15

887744

9763

1012014

55

1072016

5056

10252018

5

16

887733

3387

7162015

45000

2032016

30050

12012014

3

17

887733

3377

7162015

45000

2032016

30050

12012014

3

18

887733

3377

7162015

789

2032017

30050

12012014

3

19

887733

3377

7162015

45000

2032016

30050

12012014

3

20

887733

3377

7162015

1700

12052016

30050

12012014

3

21

887733

3377

7162015

45000

2032016

308

12011014

3

22

887733

1500

7162015

45000

2032016

30050

12012014

3

23

887733

1500

7192015

45000

2032016

30050

12012014

3

24

887722

4352

7152017

0

3042018

47531

11222014

6

25

887722

4372

7162017

0

3042018

47531

11222014

6

26

887722

5678

7152017

0

3042018

47531

11222014

6

27

887722

4352

7152017

750

3052018

47531

11222014

6

28

887722

4352

7152017

0

3042018

4700

11252014

6

 

4 REPLIES 4
PaigeMiller
Diamond | Level 26

In your previous thread "Making a long merged data set broad", it was suggested several times that you provide the data as a SAS data step, and examples were given, and instructions were given. This would speed up responses to your question tremendously, and you would receive faster and better answers.

 

I don't really glean from your description what really counts as a duplicate. I guess the explanation is not clear enough for me.

 

Did you try PROC SQL with the DISTINCT keyword?

--
Paige Miller
Jannie_D
Calcite | Level 5

dear paige. I did post the data step of my example in the previous thread and I had a good back and forwards with somebody, but I ended up with something that I could not do an analysis on. I am really trying  my best but as my server is a secure server there are loads of restrictions on what I can write here when it comes to my data. and I can not practically or legally copy anything from my real dataset here. which is of cause giving me challenges. moreover that my dataset is huge is also providing an obstacle.

 

this was the data step from my example 

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;

 

I have tried the proc sql which mr bremmer suggested in the previous post. I think I might not understand it very well, when there is loops in it. (which is something I never worked with before.)

 

maybe my problem is the way i merged the data?

 

data all;

merge lmdb sssy drga drgh lpr dtsdccg dod udda;

by pnr;

run;

 

the problem is that when I merge it has created duplicates in values of cost. So that one pnr (ID) can have the same date1 and cost1 as many times as there are different values in the other variables (even though it in reality only has been charged once). So if I want to create a variable the is the sum of all cost or cost for one specific year (per pnr (ID)) then it would be to high due to copies that showed up because of the merge. That is why i want to secure that all cost1 values that are

tied to the same date and pnr are different, otherwise they are a duplicate.

 

But maybe this problem can be solved already in the merge statement?

Sajid01
Meteorite | Level 14

Hello @Jannie_D 
You are the best person to select the algorithm and write the code to remove duplicates as per your business logic. You have have access to and better understanding of your environment and data.
I understand is not possible to either upload the full data or code from your workplace.It is the intellectual property of your employer. I can understand the stress you are in. 
There is a lot in the literature on removing duplicates. This paper is a good start   https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf.
I always keep my copy  of the Little SAS Book around. It is the first place to look for a solution. 

ballardw
Super User

@Jannie_D wrote:

I have a dataset that is a big merged dataset from several big databases. I am currently "cleaning" data. And I need to conditionally remove duplicates, so that I can create sums that I can do statistical analyses on.

I have created a tabel as an example below


I would suggest working with subsets from all of the "several big databases" on the process used to merge to prevent the duplication.

A cleaned up version will likely be more efficient in execution as well as removing duplicates. Fixing afterward means that you are wasting resources, time and disk use.

 

Break the "merge" into small steps, examine the result of each for duplicates then fix the code for that step to prevent duplicates. Then add in the next of the "several" databases. Rinse, Repeat.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 865 views
  • 1 like
  • 4 in conversation