Desktop productivity for business analysts and programmers

Take my 19K records . . . please

Reply
Occasional Contributor
Posts: 18

Take my 19K records . . . please

Yo anybody out there . . . I am using SAS EG (new user) and have had Base SAS courses as well as EG but alas, I am back here at my workstation and have an issue. I have got a data set with 2 million records. My pal in another department has given me a seperate data set that lists 19 thousand of those very same records that should not be in my data set. So I have to remove the 19K that all have unique reciept numbers. Should I create a SAS program to do this? Or is there something in EG to do this. If I write a SAS program do I subset the data and then delete? Any ideas to give . . ???
Super Contributor
Super Contributor
Posts: 3,174

Re: Take my 19K records . . . please

It's unclear whether you have 'external' files or SAS files? If you have SAS files and the two are identical, you should be able to combine the two files and remove any duplicates. If, however, the two SAS files are not truly identical with all columns, you will need to consider identifying a set of "key" variables to combine the two files and then remove the duplicates.

If you have 'externa' files, you will need to import the two files, and then perform the process listed above.

Scott Barry
SBBWorks, Inc.
Occasional Contributor
Posts: 18

Re: Take my 19K records . . . please

Thank you. They are both in SAS files now but I also want to delete the record and thus sanitize the data base of these records.
Super User
Posts: 5,386

Re: Take my 19K records . . . please

On way of doing this (depends on your data) is to use a subquery in a SQL where expression:

...
where your_table.key_col not in(select other_table.keycol from other_table)
;
quit;

Can't see anyway of doing this the point-and-click way, at least in EG 4.1. Bot you could code in the advanced expression builder in the Filter and query object.

/Linus
Data never sleeps
Community Manager
Posts: 2,889

Re: Take my 19K records . . . please

You can do it in the query builder, I think, if I understand the question correctly.

1. Add the two tables into the query, T1 (full set) and T2 (to exclude). Join on the ID (receipt number).

2. Change the Join type to Left Join (assuming the Big table is first, and "exclude" table is second). This considers ALL rows from T1.

3. Add filter where T2.ID IS MISSING.

Resulting table will have all of the records from T1 that don't have a match in T2.

Chris
Occasional Contributor
Posts: 18

Re: Take my 19K records . . . please

Best one I have heard!! Thank you!!
Occasional Contributor
Posts: 18

Re: Take my 19K records . . . please

You know when I put in T2.ID is missing in the actual case its where t2.Receipt# is missing it does not like the syntax. Is this an SQL problem where I need parens or brackets? Thanx
Community Manager
Posts: 2,889

Re: Take my 19K records . . . please

Ooh, the column name is "Receipt#" ? With the hash symbol?

The query builder is *supposed* to build that as something like

T1."Receipt#"n

The quote-quote-n is the literal syntax to allow SAS to process column names that don't follow the traditional naming rules.

As long as OPTIONS VALIDVARNAME=ANY is in effect, which EG does set by default.

Chris
SAS Employee
Posts: 149

Re: Take my 19K records . . . please

Chris' solution is definitely the best point-and-click one I know of. FYI, if you wanted to write some code instead, there is a set operator EXCEPT that's part of SQL and is designed to do exactly what you describe: rows in one data set that are not in the other.

proc sql;
create table work.modified as
select id, name
from work.large
EXCEPT
select id, name
from work.small;
quit;

The new table would contain the columns ID and NAME, and only unique rows from the first data set (work.large) not found in the second data set (work.small).
Ask a Question
Discussion stats
  • 8 replies
  • 129 views
  • 0 likes
  • 5 in conversation