Solved: Re: Unique rows between two files

jen123 · Posted 01-17-2013 07:13 PM

I am trying to get the rows that are NOT in both datasets.

**************************************

proc sql;

create table libname.file_discrepancies as select *

from libname.file_a t1, libname.file_b t2

where GEID.t1 NE GEID.t2;

run;

************************************

However, I get this error which I cannot figure out. GEID exists as numeric in both datasets.

ERROR: Unresolved reference to table/correlation name GEID.

Any advice is appreciated.

Thanks!

Linlin · Posted 01-17-2013 07:45 PM

or by merging:

proc sort data=file_a; by geid;

proc sort data=file_b; by geid;

run;

data want;

merge file_a(in=a) file_b(in=b);

by geid;

if a and b then delete;

run;

View solution in original post

art297 · Posted 01-17-2013 07:33 PM

: I don't think your code will work or do what you want, but the error you are getting is because you have the roles reversed between file reference and variable. They should be something like t1.geid, not the way you have them (reversed).

Conversely, something like the following might achieve what you want:

data file_a;

input geid;

cards;

1

2

3

4

5

;

data file_b;

input geid;

cards;

1

2

7

8

9

;

proc sql;

create table work.file_discrepancies as

(select geid from work.file_a

except

select geid from work.file_b)

union

(select geid from work.file_b

except

select geid from work.file_a)

;

quit;

jen123 · Posted 01-18-2013 09:24 AM

Hi Arthur - what does the "cards" mean?

input geid;

cards;

I tried your code. No errors, but the resutls is only the GEID variable. I want all the columns/variables.

Let's say if the GEID matches between the two files but other variables do not match, would changing the "select GEID" in your proc sql to "select *" give me only the rows where all the variables do not match?

Linlin · Posted 01-17-2013 07:45 PM

or by merging:

proc sort data=file_a; by geid;

proc sort data=file_b; by geid;

run;

data want;

merge file_a(in=a) file_b(in=b);

by geid;

if a and b then delete;

run;

jen123 · Posted 01-18-2013 09:22 AM

Linlin,

I used your code, which was easier for me to understand considering my SAS knowledge level. Anyway, I got this error: ERROR: BY variables are not properly sorted on data set JEN2.COMBINED_GDW_PRIOR_A. (which is file_b in your code above).

The error pointed out to one specific employee. However, when I look at the data row, it looks right. What does a=1 b=1 mean in the error below?

a=1 b=1 GEID=1010353890 SOEID=SK53890 Full_Name=K,Suresh......(there are 135 columns/variables).

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 104183 observations read from the data set JEN1.COMBINED_GDW_CURRENT_A.

NOTE: There were 35472 observations read from the data set JEN2.COMBINED_GDW_PRIOR_A.

WARNING: The data set JEN1.COMBINED_GDW_DISCREPANCIES may be incomplete. When this step was stopped there were 71599 observations

and 135 variables.

WARNING: Data set JEN1.COMBINED_GDW_DISCREPANCIES was not replaced because this step was stopped.

art297 · Posted 01-18-2013 09:41 AM

: Linlin's suggestion included first sorting the two files by geid. Did you do that?

a and b, in Linlin's code are boolean variables, indicating if a record came from one, the other, or both files. They were set with the in= option.

A value of 1 indicates that the record existed in that file.

As for your question about my code, cards (otherwise known as datalines) is simply a way of entering data into SAS.

jen123 · Posted 01-18-2013 09:49 AM

The first column is GEID. So I don't understand what the a= and b= mean? Is it saying that GEID is the 3 variable for this particular employee?

a=1 b=1 GEID=1010353890

Linlin · Posted 01-18-2013 09:43 AM

Would you please post your code?

jen123 · Posted 01-18-2013 09:47 AM

Sorry about that. Here is the code:

proc sort data=jen1.combined_gdw_current_a; by geid;

proc sort data=jen2.combined_gdw_prior; by geid;

run;

data jen1.combined_gdw_discrepancies;

merge jen1.combined_gdw_current_a(in=a) jen2.combined_gdw_prior_a(in=b);

by geid;

if a and b then delete;

run;

art297 · Posted 01-18-2013 09:55 AM

The second file is not the one that you sorted. You sorted jen2.combined_gdw_prior, but used jen2.combined_gdw_prior_a.

jen123 · Posted 01-18-2013 10:00 AM

The second file is not the one that you sorted. You sorted jen2.combined_gdw_prior, but used jen2.combined_gdw_prior_a.

Thanks for catching that Arthur!

jen123 · Posted 01-18-2013 09:59 AM

Sorry for being all over the place. I tried Arthur's advice and it worked. Now, what if I want for two fields (called GEID and GOC) to match. So, I am looking for a dataset where both fields do not match.

Here is my code:

proc sql;

create table jen3.combined_gdw_discrepancies_test as

(select GEID from jen1.combined_gdw_current_a

except

select GEID from jen2.combined_gdw_prior_a)

union

(select GEID from jen2.combined_gdw_prior_a

except

select GEID from jen1.combined_gdw_current_a)

;

quit;

art297 · Posted 01-18-2013 10:06 AM

As long as the files don't contain duplicate records in both files (i.e., resulting in a many-to-many match), I'd just use a variant of Linlin's code. i.e.:

proc sort data=jen1.combined_gdw_current_a; by geid goc; run;

proc sort data=jen2.combined_gdw_prior; by geid goc; run;

data jen1.combined_gdw_discrepancies;

merge jen1.combined_gdw_current_a(in=a) jen2.combined_gdw_prior(in=b);

by geid goc;

if a and b then delete;

run;

Re: Unique rows between two files

jen123 · Posted 01-18-2013 10:10 AM

That worked! How do I mark both you and Linlin with the correct answer for this question?

Thank you Linlin and Arthur!

art297 · Posted 01-18-2013 10:16 AM

You can mark Linlin as "correct" and, if desired, I'd be quite happy with a "helpful" thank you. More importantly, glad to see that you achieved what you needed to accomplish.

Registration is open

SAS Training: Just a Click Away