- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all
I want to compare two large dasets (over 3 millions of rows) and I am introduced myselft to proc compare.
I tell you what are I Doing to run the code.
First I run a proc sort on both datasets to order the variables. After this I change the name and format of the compare dataset for having the same structure.
and the last step
proc compare base = a compare=b /* MAXPRINT=total*/
out= result outnoequal outbase outcomp outdif outpercent transpose;
var var1................varn;
run;
This is a sample of the result I have
_OBS_1=76637 _OBS_2=76637:
Variable Valor base Compare Diff. % Diff
var1 3039.07 3039.07 -4.54747E-13 -1.49634E-14
var2 3039.07 3039.07 -4.54747E-13 -1.49634E-14
and as this sample over 50,000 more
what do I doing wrong?, how can I fix this problem?
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.
For example:
proc compare base = a compare=b fuzz= 0.00001
out= result outnoequal outbase outcomp outdif outpercent transpose;
<remainder of compare code here>
will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, what is the question? The output from proc compare is not always easy to read and so may require some post-processing. Also, if you are comparing data of that size you may want to consider a bit more what you want to compare, for instance can you group the data, maybe summarise, of do by group joins to identify certain fall outs. There are other methodologies also, some of the stats procedures will give you counts, outliers. Distinct lists for codes, defined lists. Maybe if you have a visit schedule, you could pull that out as a tabulate etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I know both variables are the same so I'd like to ajust the proc to obtain equalities in variable like the sample. also I have tested with proc means but I have the same problem plus the outpercent option don't work for me because I don't how to change de format.
What are the other methods I can use to compare two datasets?
base 25451159.00
compare 25451275.42
diff 116.42
percent 0.00
thanks in advanced
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are running into the display versus storage issue. You may see in your datasets:
3039.07 and 3039.07
Which look the same. However the data is really (an example):
3039.0700000000000000001 and 3039.07
Hence a fractional difference. I have come up against this several times and it can happen in several different ways, most notable if the base programmer uses SAS functions to create a calculated variable, and I use SQL. Its quite annoying. What I would suggest is that you identify what the precision needs to be for your compare values and then explicitly round to that precision. I.e. round(variable_from_base,8.2) = round(variable_from_comp,8.2).
So round them in both datasets then do a compare again. You could also put both variables into a character field.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In regarding to precision issue, it could get messier than just that.
data _NULL_;
A=5.1;
B=3.5;
C1=1.6;
C2=A-B;
IF C1=C2 THEN
PUT C1 30.28 " EQUAL " C2 30.28;
ELSE PUT C1 30.28 " NOT EQUAL " C2 30.28;
PUT C1 HEX16./C2 HEX16. ;
RUN;
SAS (like other computer software) is using Hex. format to store the value, Decimal has always been a approximate approach. In this case, only Hex format can tell whether they are equal or not.
Regards,
Haikuo
Edit:
of course, Binary will also do:
PUT C1 BINARY64. /C2 BINARY64.;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the fact that your variables for some reason differ by a minute amount but will not affect your other uses you can use the FUZZ option on the Proc Compare statement to only report inequalities greater than the value you set.
For example:
proc compare base = a compare=b fuzz= 0.00001
out= result outnoequal outbase outcomp outdif outpercent transpose;
<remainder of compare code here>
will only report an inequaly greater than 0.00001 (or 1 E-5). Given your example a fuzz of 0.00000001 would probably remove most of the inequality messages.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello SergioSanchez,
You can specify the criterion for judging the equality of numeric values. See:
PROC COMPARE :: Base SAS(R) 9.4 Procedures Guide, Second Edition
and:
Concepts: COMPARE Procedure :: Base SAS(R) 9.4 Procedures Guide, Second Edition
You might try:
method=absolute
criterion=0.0000001
to suppress differences smaller than 10e-7, for example, unless you're not sure if they're significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Think you need BY and or ID statements to make key-based compare
good luck
peterC
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I finally resolved with the options;
fuzz= 0.00001 criterion=-1000
At the moment it´s working fine
Thanks all for your time and posts
Regards