Re: Identify Duplicates along with %age Duplicates per month

narulap · Posted 04-18-2018 10:53 AM

I want to identify the duplicate %age as per month Starting from Jan until Dec.

I could find the duplicates and put in separate data set by the following code:

proc sort data = pankaj.UM_yield_detail_mm 
 dupout= Duplicate_NODUPS
 nodupkey ;
 by batch_id ;
run ;

Basically, I want to have duplicate count and %age of duplicates based on the original data set which contains Variable "Batch_ID". if in case I don't want to create another other data set. Is it possible or I have to create another data set.

Batch_ID
8809510976
8807226488
8809932795
8810161136
8810161136
8810172306
8809510752
8809436613
8809510734
8809510734
8810207111
8809510734
8810160901
8810207111
8809724861
8807221194
8807221200
8807221205
8807221217
8807221217
8807221217
8807221217
8807221200
8807221217
8807221200
8810160901
8810160901

Reeza · Posted 04-18-2018 11:14 AM

Can you post an example of what you're expecting please?

I suspect its a double proc freq but not sure.

proc freq data=have noprint;
table batch_id / out=count_id;
run;

proc freq data=count_id noprint;
table count;
run;

@narulap wrote:

I want to identify the duplicate %age as per month Starting from Jan until Dec.

I could find the duplicates and put in separate data set by the following code:
proc sort data = pankaj.UM_yield_detail_mm 
 dupout= Duplicate_NODUPS
 nodupkey ;
 by batch_id ;
run ;
Basically, I want to have duplicate count and %age of duplicates based on the original data set which contains Variable "Batch_ID". if in case I don't want to create another other data set. Is it possible or I have to create another data set.

Batch_ID

8809510976

8807226488

8809932795

8810161136

8810161136

8810172306

8809510752

8809436613

8809510734

8809510734

8810207111

8809510734

8810160901

8810207111

8809724861

8807221194

8807221200

8807221205

8807221217

8807221217

8807221217

8807221217

8807221200

8807221217

8807221200

8810160901

8810160901

narulap · Posted 04-18-2018 11:39 AM

I have a large data set which contains for instance, 200,000 observations in the below format. I need to identify the DUPLICATE BATCH_ID and %age of DUPLICATES.

Suppose, I have 10000 Batch_ID Duplicates out of 200,000. Then I have .5% Duplicate of the Total Batch ID. Then I can calculate the count of duplicates and

BATCH_ID

TIME

SIZE

TYPE

CUSTOMER

ARTICLE

For Better understanding, I have attached the SS.

The code you have provided. I have already implemented it but it doesn't gives me the Percentage out of 200,000 observations.

Reeza · Posted 04-18-2018 11:42 AM

Add the percent options in the PROC FREQ - you're looking for 1- (percent of 1's) on the second table.

narulap · Posted 04-23-2018 08:04 AM

I wrote the code for %age Duplicates and I am getting the output by MONTH and YEAR which I wanted.

	proc sql;
				create table duplicates_&p2 as
					select (count(batch_id)- count(distinct(batch_id)))/count(batch_id) 
					as 	Duplicate format percent8.2, 	
					month(datepart(Ptime_build)) as Month,
					year(datepart(Ptime_Build)) as Year

					from TABLENAME
					where "&Date_Range_Min."dt <= PTIME_BUILD <="&Date_Range_Max"dt and PTIME_BUILD is not missing
					group by Month, Year;
			quit;

Thanks for your suggestion.

narulap · Posted 04-23-2018 10:30 AM

As Posted earlier, I am able to find the %age of duplicates. Now, I use SGPLOT to plot a line graph for the analysis.

Duplicate % on Y Axis and Month on X axis.

I am able to do it for One Variable i.e Month and want to combine Month&Year on X axis.

	proc sql;
				create table duplicates_11 as
					select (count(batch_id)- count(distinct(batch_id)))/count(batch_id) 
							as 	Duplicate format percent8.2, 	
							month(datepart(Ptime_build)) as Month,
							year(datepart(Ptime_Build)) as Year,

					Final_Date = MDY(Month,1,Year) format=MONYY5. 
					MDY(Month, 1, Year) format=monyy5. as Final_Date

					from pankaj.mv_yield_detail_mm
					where PTIME_BUILD is not missing
				group by Month, Year;
			quit;

Am I using the wrong format or the syntax is incorrect?

Identify Duplicates along with %age Duplicates per month