compare two similar dataset

cho16 · Posted 02-15-2017 06:22 PM

Hi,

I have requirement to compare two monthly extracts which has same varibales, but values may be changed between two datasets. I need to extract observations that has different value in nay varibale compare to last month dataset. For example below:

DATA SasConf;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 newyork NY

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

MWSUG 2006 Dearborn MI

SUGI 2007 Orlando FL

;;;;

DATA SasConf2;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 Philadelphia PA

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

PNWSUG 2006 Seaside OR

NESUGI 2007 Orlando FL

;;;;

Compare these two datasets and the output datset should have following obervations:

NESUG 2006 Philadelphia PA

NESUGI 2007 Orlando FL

Note: there is no primary key in the dataset, so any varibale value can be different, so need to select the observation which is changed from last month and also ignore any new observation in the current month dataset.

Thank you for all your help.

Peter_C · Posted 02-15-2017 06:29 PM

Have you looked at PROC COMPARE?

Quentin · Posted 02-15-2017 08:00 PM

I don't see how this will be possible, if there is no key in the datasets. You say you want to select observations with changed values, and ignore new observations. If there is no key, how will you tell the difference between a changed observation and a new observation?

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

mkeintz · Posted 02-15-2017 08:56 PM

If what you want are records that are not completely matched in the two datasets, you could do the following program.

It only assumes that both dataset A and B have exactly the same variables.

It will create two datasets: a_only and b_only. But it does NOT tell you which b_only record is most likely to be associated with an a_only record.

data a_only (drop=rc);

set a end=end_of_a;

if _n_=1 then do;

declare hash b (dataset:'b');

b.definekey(all:'Y');

b.definedata(all:'Y');

b.definedone();

end;

rc=b.find();

if rc=0 then rc=b.remove(); /* It's a match: don't output A and remove from hash b*/

else output a_only;

if end_of_a then rc=b.output(dataset:'bonly');

run;

Edit note: Also I think there is an "except" operator in proc sql that could provide the unmatched records.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 02-15-2017 09:59 PM

That is very complicated problem.
You said there are not key variables, how can you measure the similarity  between these two string.

Why would you select
NESUGI 2007 Orlando FL

I think it is a new obs ,since it isn't included in the first table.

mkeintz · Posted 02-15-2017 10:08 PM

Even so, it probably makes a lot of sense to first eliminate exactly matched records.

Then, with the (hopefully) small set of unmatched records, do some sort of spelling distance or general editting distance between them. Or one could count the number/proportion of common words in every paired comparison.

Of course, this presumes that the op only expects 1 to 1 matches.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 02-16-2017 12:35 PM

I would have started with

proc sql;
   create table want as
   select * 
   from sasconf2
   except
   select * from sasconf 
   ;
quit;

to show me everything in the second table not in the first. Which shows new records as well as changes.

compare two similar dataset

Re: compare two similar dataset

Re: compare two similar dataset

Re: compare two similar dataset

Re: compare two similar dataset

Re: compare two similar dataset

Re: compare two similar dataset

Classroom Training Available!