turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Need help with proc compare

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 05:44 AM

Hi all

I want to compare two large dasets (over 3 millions of rows) and I am introduced myselft to proc compare.

I tell you what are I Doing to run the code.

First I run a proc sort on both datasets to order the variables. After this I change the name and format of the compare dataset for having the same structure.

and the last step

proc compare base = a compare=b /* MAXPRINT=total*/

out= result outnoequal outbase outcomp outdif outpercent transpose;

var var1................varn;

run;

This is a sample of the result I have

_OBS_1=76637 _OBS_2=76637:

Variable Valor base Compare Diff. % Diff

var1 3039.07 3039.07 -4.54747E-13 -1.49634E-14

var2 3039.07 3039.07 -4.54747E-13 -1.49634E-14

and as this sample over 50,000 more

what do I doing wrong?, how can I fix this problem?

Thanks

Accepted Solutions

Solution

06-17-2014
10:41 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 10:41 AM

If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.

For example:

proc compare base = a compare=b fuzz= 0.00001

out= result outnoequal outbase outcomp outdif outpercent transpose;

<remainder of compare code here>

will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 05:51 AM

Sorry, what is the question? The output from proc compare is not always easy to read and so may require some post-processing. Also, if you are comparing data of that size you may want to consider a bit more what you want to compare, for instance can you group the data, maybe summarise, of do by group joins to identify certain fall outs. There are other methodologies also, some of the stats procedures will give you counts, outliers. Distinct lists for codes, defined lists. Maybe if you have a visit schedule, you could pull that out as a tabulate etc.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 06:20 AM

I know both variables are the same so I'd like to ajust the proc to obtain equalities in variable like the sample. also I have tested with proc means but I have the same problem plus the outpercent option don't work for me because I don't how to change de format.

What are the other methods I can use to compare two datasets?

base 25451159.00

compare 25451275.42

diff 116.42

percent 0.00

thanks in advanced

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 06:29 AM

It looks like you are running into the display versus storage issue. You may see in your datasets:

3039.07 and 3039.07

Which look the same. However the data is really (an example):

3039.0700000000000000001 and 3039.07

Hence a fractional difference. I have come up against this several times and it can happen in several different ways, most notable if the base programmer uses SAS functions to create a calculated variable, and I use SQL. Its quite annoying. What I would suggest is that you identify what the precision needs to be for your compare values and then explicitly round to that precision. I.e. round(variable_from_base,8.2) = round(variable_from_comp,8.2).

So round them in both datasets then do a compare again. You could also put both variables into a character field.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 01:28 PM

In regarding to precision issue, it could get messier than just that.

**data** _NULL_;

A=**5.1**;

B=**3.5**;

C1=**1.6**;

C2=A-B;

IF C1=C2 THEN

PUT C1 **30.28** " EQUAL " C2 **30.28**;

ELSE PUT C1 **30.28** " NOT EQUAL " C2 **30.28**;

PUT C1 HEX16./C2 HEX16. ;

**RUN**;

SAS (like other computer software) is using Hex. format to store the value, Decimal has always been a approximate approach. In this case, only Hex format can tell whether they are equal or not.

Regards,

Haikuo

Edit:

of course, Binary will also do:

PUT C1 BINARY64. /C2 BINARY64.;

Solution

06-17-2014
10:41 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 10:41 AM

If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.

For example:

proc compare base = a compare=b fuzz= 0.00001

out= result outnoequal outbase outcomp outdif outpercent transpose;

<remainder of compare code here>

will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 02:55 PM

Hello SergioSanchez,

You can specify the criterion for judging the equality of numeric values. See:

PROC COMPARE :: Base SAS(R) 9.4 Procedures Guide, Second Edition

and:

Concepts: COMPARE Procedure :: Base SAS(R) 9.4 Procedures Guide, Second Edition

You might try:

method=absolute

criterion=0.0000001

to suppress differences smaller than 10e-7, for example, unless you're not sure if they're significant.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2014 03:21 PM

Think you need BY and or ID statements to make key-based compare

good luck

peterC

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-18-2014 06:27 AM

I finally resolved with the options;

fuzz= 0.00001 criterion=-1000

At the moment it´s working fine

Thanks all for your time and posts

Regards