SAS Procedures

SergioSanchez · Posted 06-17-2014 05:44 AM

Hi all

I want to compare two large dasets (over 3 millions of rows) and I am introduced myselft to proc compare.

I tell you what are I Doing to run the code.

First I run a proc sort on both datasets to order the variables. After this I change the name and format of the compare dataset for having the same structure.

and the last step

proc compare base = a compare=b /* MAXPRINT=total*/

out= result outnoequal outbase outcomp outdif outpercent transpose;

var var1................varn;

run;

This is a sample of the result I have

_OBS_1=76637 _OBS_2=76637:

Variable Valor base Compare Diff. % Diff

var1 3039.07 3039.07 -4.54747E-13 -1.49634E-14

var2 3039.07 3039.07 -4.54747E-13 -1.49634E-14

and as this sample over 50,000 more

what do I doing wrong?, how can I fix this problem?

Thanks

ballardw · Posted 06-17-2014 10:41 AM

If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.
For example:

proc compare base = a compare=b fuzz= 0.00001

out= result outnoequal outbase outcomp outdif outpercent transpose;

will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.

View solution in original post

RW9 · Posted 06-17-2014 05:51 AM

Sorry, what is the question? The output from proc compare is not always easy to read and so may require some post-processing. Also, if you are comparing data of that size you may want to consider a bit more what you want to compare, for instance can you group the data, maybe summarise, of do by group joins to identify certain fall outs. There are other methodologies also, some of the stats procedures will give you counts, outliers. Distinct lists for codes, defined lists. Maybe if you have a visit schedule, you could pull that out as a tabulate etc.

SergioSanchez · Posted 06-17-2014 06:20 AM

I know both variables are the same so I'd like to ajust the proc to obtain equalities in variable like the sample. also I have tested with proc means but I have the same problem plus the outpercent option don't work for me because I don't how to change de format.

What are the other methods I can use to compare two datasets?

base 25451159.00

compare 25451275.42

diff 116.42

percent 0.00

thanks in advanced

RW9 · Posted 06-17-2014 06:29 AM

It looks like you are running into the display versus storage issue. You may see in your datasets:

3039.07 and 3039.07

Which look the same. However the data is really (an example):

3039.0700000000000000001 and 3039.07

Hence a fractional difference. I have come up against this several times and it can happen in several different ways, most notable if the base programmer uses SAS functions to create a calculated variable, and I use SQL. Its quite annoying. What I would suggest is that you identify what the precision needs to be for your compare values and then explicitly round to that precision. I.e. round(variable_from_base,8.2) = round(variable_from_comp,8.2).

So round them in both datasets then do a compare again. You could also put both variables into a character field.

Haikuo · Posted 06-17-2014 01:28 PM

In regarding to precision issue, it could get messier than just that.

data _NULL_;

A=5.1;

B=3.5;

C1=1.6;

C2=A-B;

IF C1=C2 THEN

PUT C1 30.28 " EQUAL " C2 30.28;

ELSE PUT C1 30.28 " NOT EQUAL " C2 30.28;

PUT C1 HEX16./C2 HEX16. ;

RUN;

SAS (like other computer software) is using Hex. format to store the value, Decimal has always been a approximate approach. In this case, only Hex format can tell whether they are equal or not.

Regards,

Haikuo

Edit:

of course, Binary will also do:

PUT C1 BINARY64. /C2 BINARY64.;

ballardw · Posted 06-17-2014 10:41 AM

If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.
For example:

proc compare base = a compare=b fuzz= 0.00001

out= result outnoequal outbase outcomp outdif outpercent transpose;

will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.

JasonAllen · Posted 06-17-2014 02:55 PM

Hello SergioSanchez,

You can specify the criterion for judging the equality of numeric values. See:

PROC COMPARE :: Base SAS(R) 9.4 Procedures Guide, Second Edition

and:

Concepts: COMPARE Procedure :: Base SAS(R) 9.4 Procedures Guide, Second Edition

You might try:

method=absolute

criterion=0.0000001

to suppress differences smaller than 10e-7, for example, unless you're not sure if they're significant.

Peter_C · Posted 06-17-2014 03:21 PM

Think you need BY and or ID statements to make key-based compare

good luck

peterC

SergioSanchez · Posted 06-18-2014 06:27 AM

I finally resolved with the options;

fuzz= 0.00001 criterion=-1000

At the moment it´s working fine

Thanks all for your time and posts

Regards

SAS Procedures

Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Re: Need help with proc compare

Follow Us

What is...

SAS Procedures

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...