I am using a simple proc sql with a data set which contains millions of records.
The code is like
PROC SQL NOPRINT;
CREATE TABLE TABLE1
SELECT * FROM DM.TABLE
WHERE COMPANY="xxx"
;
Now I ran the code again with
PROC SQL NOPRINT;
CREATE TABLE TABLE2
SELECT * FROM DM.TABLE
WHERE COMPANY="xxx"
;
Now I am doing a proc compare without any by variable.
PROC COMPARE BASE=TABLE1 COMPARE=TABLE2;
Now proc compare compares the two data sets observation by observation or with a ID variable. ID variable is absent. What I think proc compare shold give exact matching. But though the rowcount is same it is giving some difference. When applied by variable on the primary key of the source data set,it is giving the exact match. Why it is happening can anybody help?
This is because how the proc compare works. If you given ID variable it will compare on the base of that ID. else each row will be compared with each in second table and difference is given.For more information you can read about proc compare in detail.
Even there is no ID variable then proc compare will compare each observation from the tables. So ideally as I am fetching the same value with the same filter condition, proc compare should return ideal match.
Another possibility is that the results of PROC SQL are in different order with different executions. SQL does not guarantee the output order to be the same each time; add the ORDER BY clause to get them in some sort of ID order. If DM.TABLE is on a database (rather than being a native SAS dataset), it may be doing the processing and you have no idea of the resulting data order.
Doc Muhlbaier
Duke
Even if DM.TABLE were a SAS data set, the structure might have changed between runs. Here is a simple change:
data DM.TABLE;
set DM.TABLE;
run;
This step could cause SQL to retrieve records in a different order. How? Well, it is possible that the original data set had an index where COMPANY was the primary indexed variable. Running through the DATA step destroys the reliability of the index. So the first SQL uses the index and the second one doesn't. We need to know more about where the data comes from, and what changed in between the first and second SQL runs.
This would happen if the source data (DM.TABLE) changed in between the two runs. It could be as simple as having run a PROC SORT in between.
Proc compare compares each column from table 1 to table 2 in a row. For example if there is a rundate in table1 and rundate in table2 need not be same even you started to run the code same time to get table1 and table2.
For comparing two tables I am sure this is going to be usefulcode.
/* first outer join table1 and table2*/
Proc SQL;
Create table table1andtable2outerjoin as
Select a.id as id_1,
a.name as name_1,
b.id as id_2,
b.name as name_2
From table1 as a
Full Outer Join
table2 as b
on a.id = b.id and
a.name=b.name;
Quit;
/*to see rows only in table2 not in table1 you can add more columns for comparison*/
Data tablenew;
Source = 'two';
Length Source $5;
Set table1andtable2outerjoin ;
if missing(id_1) and missing(name_1)
then output;
Format Source $5.;
Informat Source $5.;
run;
/*to see rows only in table1not in table2 you can add more columns for comparison*/
Data tableold;
Source = 'one';
Length Source $5;
Set table1andtable2outerjoin;
if missing(id_2) and missing(name_2) then output;
Format Source $5.;
Informat Source $5.;
run;
/*to see rows side by side*/
Data tablenewold;
Set tablenew tableold;
run;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.