Hi All,
How can i filter the duplicate values using query builder ?
As shown below, i would like to filter the duplicate value that appear in Employee_ID column
Step 1
======
Build a new query on your dataset. Drag Employee_ID into the "Select Data" box, and then drag it in again. You'll have two columns, probably "Employee_ID" and "Employee_ID1".
On the second one, in the "Summary" column, select "Count". This will cause EG to create a variable which is the count of that value of Employee_ID. It turns it into a "Computed column".
In the "Filter Data" tab, drag this variable into the "Filter the summarized data" area, and set the filter to > 1.
Run the query. You should get a list of all of the Employee_ID values that appear more than once in your input file.
Step 2
======
This depends on what you need to do next. The general principle is to join this new table back to your original table, and get some kind of information from the join. Here's a simple example:
Build a new query on your original dataset. Drag all of the columns into the "Select Data" tab.
Click the "Add Tables" button, and select the table you created in Step 1. Drag the "COUNT" field into Select Data.
Click the "Join Tables" button, and in the join window make sure the tables are joined on Employee_ID. Double click the "Join" picture, and change it to a Left Join. Close the Join window.
Run the query. You should see the count variable added to the results, it will have the count for cases where there are more than one value for the ID, otherwise it will be missing.
This is a quite complex example to take on at the start. I strongly recommend that you get yourself into some Enterprise Guide training, where they'll cover topics like this and you'll have the chance to practice with a knowledgeable instructor!
Tom
Hi Tom,
i have tried the step 1 solution suggested by you, but there's no result.
Attached herewith the project script for your reference.
Kindly advise any steps, i have missed out ?
Thank you.
Yes, it's because you have all of your variables in the "Select Data" list. ONLY have the two copies of Employee_ID, otherwise it won't work.
Tom
Hi Tom,
I have tried your suggested solution, it works for numeric data.
What if my data contains numeric and character, how can i count it ? i tried the way advised by you, the error prompt "Query Syntax is not valid".
Thank you.
The problem is that somehow the filter became
(CALCULATED COUNT_of_Job_Title) = UPPER('1')
which is comparing a character to a numeric. You need to test for greater than one, so if you use the dialog to build the filter it should come out looking like
(CALCULATED COUNT_of_Job_Title) > 1
Hi Tom,
Thanks for your advice 🙂
It works now.
My final step is to show only one single row for those duplicated values.
Thus, what formula should I use to remove those duplicate values and only show one single value.
For instance, if the word "Jerry" appear more than 1 times, I would like to remove it and only show it once in the row.
Thanks in advance 🙂
How exactly do you want to select the one row that remains? If you just need a table with that variable and the value one time, you can just create a query with only that variable, and ask for "distinct" values.
Tom
Office ID | First Name | LAST NAME | POSITION | COUNTRY | Income | Product |
120102 | Tom | Zhou | Sales Manager | AU | 14,971.00 | PAPER |
120102 | Tom | Zhou | Sales Rep. II | UK | 14,971.00 | PAPER |
120102 | Tom | Zhou | Sales Manager | AU | 14,971.00 | TISSUE |
120122 | Kimiko | Ngan | Sales Rep. II | AU | (523.00) | TISSUE |
120123 | Kimiko | Hotstone | Sales Rep. I | AU | 3,193.00 | TISSUE |
120124 | Lucian | Daymond | Sales Rep. I | UK | 1,228.00 | PEN |
120124 | Lucian | Daymond | Sales Rep. IV | UK | 1,228.00 | PEN |
120126 | Satyakam | Denny | Sales Rep. II | US | 11,951.00 | PEN |
120127 | Sharryn | Clarkson | Sales Rep. II | AU | 8,404.00 | PAPER |
As shown in the table above, i would like to remove the row where the first name, last name & Income field appear the same info more than one times. How can i create a query and ask for "distinct" value as some of the fields consist of identical or distinct values.?
For instance, after i sorted the data, there're three rows of Office IDs (120102) with the same Name & Income, but different values for Product, Country and Position fields. Thus, I would like to remove the rest of two rows although the values are different in Product, Country and Position fields.
P/s Product, Country and Position fields should be remained in the table.
I have tried the solution below advised by you, but this will remove all the duplicate values included the original value after I filter the summarized data
"
Step 1
======
Build a new query on your dataset. Drag Employee_ID into the "Select Data" box, and then drag it in again. You'll have two columns, probably "Employee_ID" and "Employee_ID1".
On the second one, in the "Summary" column, select "Count". This will cause EG to create a variable which is the count of that value of Employee_ID. It turns it into a "Computed column".
In the "Filter Data" tab, drag this variable into the "Filter the summarized data" area, and set the filter to > 1.
Run the query. You should get a list of all of the Employee_ID values that appear more than once in your input file.
Step 2
======
This depends on what you need to do next. The general principle is to join this new table back to your original table, and get some kind of information from the join. Here's a simple example:
Build a new query on your original dataset. Drag all of the columns into the "Select Data" tab.
Click the "Add Tables" button, and select the table you created in Step 1. Drag the "COUNT" field into Select Data.
Click the "Join Tables" button, and in the join window make sure the tables are joined on Employee_ID. Double click the "Join" picture, and change it to a Left Join. Close the Join window.
Run the query. You should see the count variable added to the results, it will have the count for cases where there are more than one value for the ID, otherwise it will be missing."
SQL isn't well suited to that kind of a requirement. To do this, I would recommend a DATA step with a BY option, as below.
Tom
data Want;
set Have;
by FirstName LASTNAME Income notsorted;
if last.Income then
output;
run;
Hi Tom,
Thanks for your advice. I got the outcome which i want.
However, can you explain the meaning of the coding.
by FirstName LASTNAME Income notsorted;
if last.Income
Thank you.
The BY statement tells SAS to set up special "first" and "last" variables for each group of observations that have the same value of the BY variable(s).
Then when I say "if last.Income", it means if this is the last observation that has this particular set of first name, last name, and income.
So I'm saving only one observation with that combination of values, the last one.
Here's a technical description:
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.