Re: Filter Duplicate value

SASnewbie2 · Posted 10-09-2018 04:43 AM

Hi All,

How can i filter the duplicate values using query builder ?

As shown below, i would like to filter the duplicate value that appear in Employee_ID column

TomKari · Posted 10-09-2018 10:11 AM

Step 1
======

Build a new query on your dataset. Drag Employee_ID into the "Select Data" box, and then drag it in again. You'll have two columns, probably "Employee_ID" and "Employee_ID1".

On the second one, in the "Summary" column, select "Count". This will cause EG to create a variable which is the count of that value of Employee_ID. It turns it into a "Computed column".

In the "Filter Data" tab, drag this variable into the "Filter the summarized data" area, and set the filter to > 1.

Run the query. You should get a list of all of the Employee_ID values that appear more than once in your input file.

Step 2
======

This depends on what you need to do next. The general principle is to join this new table back to your original table, and get some kind of information from the join. Here's a simple example:

Build a new query on your original dataset. Drag all of the columns into the "Select Data" tab.

Click the "Add Tables" button, and select the table you created in Step 1. Drag the "COUNT" field into Select Data.

Click the "Join Tables" button, and in the join window make sure the tables are joined on Employee_ID. Double click the "Join" picture, and change it to a Left Join. Close the Join window.

Run the query. You should see the count variable added to the results, it will have the count for cases where there are more than one value for the ID, otherwise it will be missing.

This is a quite complex example to take on at the start. I strongly recommend that you get yourself into some Enterprise Guide training, where they'll cover topics like this and you'll have the chance to practice with a knowledgeable instructor!

Tom

SASnewbie2 · Posted 10-10-2018 05:11 AM

Hi Tom,

i have tried the step 1 solution suggested by you, but there's no result.

Attached herewith the project script for your reference.

Kindly advise any steps, i have missed out ?

Thank you.

TomKari · Posted 10-13-2018 02:14 PM

Yes, it's because you have all of your variables in the "Select Data" list. ONLY have the two copies of Employee_ID, otherwise it won't work.

Tom

SASnewbie2 · Posted 10-15-2018 04:54 AM

Hi Tom,

I have tried your suggested solution, it works for numeric data.

What if my data contains numeric and character, how can i count it ? i tried the way advised by you, the error prompt "Query Syntax is not valid".

Thank you.

TomKari · Posted 10-15-2018 09:35 AM

The problem is that somehow the filter became

(CALCULATED COUNT_of_Job_Title) = UPPER('1')

which is comparing a character to a numeric. You need to test for greater than one, so if you use the dialog to build the filter it should come out looking like

(CALCULATED COUNT_of_Job_Title) > 1

SASnewbie2 · Posted 10-16-2018 02:06 AM

Hi Tom,

Thanks for your advice 🙂

It works now.

My final step is to show only one single row for those duplicated values.

Thus, what formula should I use to remove those duplicate values and only show one single value.

For instance, if the word "Jerry" appear more than 1 times, I would like to remove it and only show it once in the row.

Thanks in advance 🙂

TomKari · Posted 10-16-2018 09:52 AM

How exactly do you want to select the one row that remains? If you just need a table with that variable and the value one time, you can just create a query with only that variable, and ask for "distinct" values.

Tom

SASnewbie2 · Posted 10-16-2018 10:05 PM

Office ID	First Name	LAST NAME	POSITION	COUNTRY	Income	Product
120102	Tom	Zhou	Sales Manager	AU	14,971.00	PAPER
120102	Tom	Zhou	Sales Rep. II	UK	14,971.00	PAPER
120102	Tom	Zhou	Sales Manager	AU	14,971.00	TISSUE
120122	Kimiko	Ngan	Sales Rep. II	AU	(523.00)	TISSUE
120123	Kimiko	Hotstone	Sales Rep. I	AU	3,193.00	TISSUE
120124	Lucian	Daymond	Sales Rep. I	UK	1,228.00	PEN
120124	Lucian	Daymond	Sales Rep. IV	UK	1,228.00	PEN
120126	Satyakam	Denny	Sales Rep. II	US	11,951.00	PEN
120127	Sharryn	Clarkson	Sales Rep. II	AU	8,404.00	PAPER

As shown in the table above, i would like to remove the row where the first name, last name & Income field appear the same info more than one times. How can i create a query and ask for "distinct" value as some of the fields consist of identical or distinct values.?

For instance, after i sorted the data, there're three rows of Office IDs (120102) with the same Name & Income, but different values for Product, Country and Position fields. Thus, I would like to remove the rest of two rows although the values are different in Product, Country and Position fields.

P/s Product, Country and Position fields should be remained in the table.

I have tried the solution below advised by you, but this will remove all the duplicate values included the original value after I filter the summarized data

"

Step 1
======

Build a new query on your dataset. Drag Employee_ID into the "Select Data" box, and then drag it in again. You'll have two columns, probably "Employee_ID" and "Employee_ID1".

On the second one, in the "Summary" column, select "Count". This will cause EG to create a variable which is the count of that value of Employee_ID. It turns it into a "Computed column".

In the "Filter Data" tab, drag this variable into the "Filter the summarized data" area, and set the filter to > 1.

Run the query. You should get a list of all of the Employee_ID values that appear more than once in your input file.

Step 2
======

This depends on what you need to do next. The general principle is to join this new table back to your original table, and get some kind of information from the join. Here's a simple example:

Build a new query on your original dataset. Drag all of the columns into the "Select Data" tab.

Click the "Add Tables" button, and select the table you created in Step 1. Drag the "COUNT" field into Select Data.

Click the "Join Tables" button, and in the join window make sure the tables are joined on Employee_ID. Double click the "Join" picture, and change it to a Left Join. Close the Join window.

Run the query. You should see the count variable added to the results, it will have the count for cases where there are more than one value for the ID, otherwise it will be missing."

TomKari · Posted 10-18-2018 07:25 PM

SQL isn't well suited to that kind of a requirement. To do this, I would recommend a DATA step with a BY option, as below.

Tom

data Want;
	set Have;
	by FirstName LASTNAME Income notsorted;
	if last.Income then
		output;
run;

SASnewbie2 · Posted 10-22-2018 05:43 AM

Hi Tom,

Thanks for your advice. I got the outcome which i want.

However, can you explain the meaning of the coding.

	by FirstName LASTNAME Income notsorted;
	if last.Income

Thank you.

TomKari · Posted 10-22-2018 02:37 PM

The BY statement tells SAS to set up special "first" and "last" variables for each group of observations that have the same value of the BY variable(s).

Then when I say "if last.Income", it means if this is the last observation that has this particular set of first name, last name, and income.

So I'm saving only one observation with that combination of values, the last one.

Here's a technical description:

https://documentation.sas.com/?docsetId=lestmtsref&docsetVersion=9.4&docsetTarget=p0yeyftk8ftuckn1o5...

Registration is open

SAS Training: Just a Click Away