topic compare two similar dataset in SAS Programming

compare two similar dataset

cho16 — Wed, 15 Feb 2017 23:22:25 GMT

Hi,

I have requirement to compare two monthly extracts which has same varibales, but values may be changed between two datasets. I need to extract observations that has different value in nay varibale compare to last month dataset. For example below:

DATA SasConf;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 newyork NY

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

MWSUG 2006 Dearborn MI

SUGI 2007 Orlando FL

;;;;

DATA SasConf2;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 Philadelphia PA

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

PNWSUG 2006 Seaside OR

NESUGI 2007 Orlando FL

;;;;

Compare these two datasets and the output datset should have following obervations:

NESUG 2006 Philadelphia PA

NESUGI 2007 Orlando FL

Note: there is no primary key in the dataset, so any varibale value can be different, so need to select the observation which is changed from last month and also ignore any new observation in the current month dataset.

Thank you for all your help.

Re: compare two similar dataset

Peter_C — Wed, 15 Feb 2017 23:29:54 GMT

Have you looked at PROC COMPARE?

Re: compare two similar dataset

Quentin — Thu, 16 Feb 2017 01:00:31 GMT

I don't see how this will be possible, if there is no key in the datasets. You say you want to select observations with changed values, and ignore new observations. If there is no key, how will you tell the difference between a changed observation and a new observation?

Re: compare two similar dataset

mkeintz — Thu, 16 Feb 2017 01:58:36 GMT

If what you want are records that are not completely matched in the two datasets, you could do the following program.

It only assumes that both dataset A and B have exactly the same variables.

It will create two datasets: a_only and b_only. But it does NOT tell you which b_only record is most likely to be associated with an a_only record.

data a_only (drop=rc);

set a end=end_of_a;

if _n_=1 then do;

declare hash b (dataset:'b');

b.definekey(all:'Y');

b.definedata(all:'Y');

b.definedone();

end;

rc=b.find();

if rc=0 then rc=b.remove(); /* It's a match: don't output A and remove from hash b*/

else output a_only;

if end_of_a then rc=b.output(dataset:'bonly');

run;

Edit note: Also I think there is an "except" operator in proc sql that could provide the unmatched records.

Re: compare two similar dataset

Ksharp — Thu, 16 Feb 2017 02:59:48 GMT

That is very complicated problem.
You said there are not key variables, how can you measure the similarity  between these two string.

Why would you select
NESUGI 2007 Orlando FL

I think it is a new obs ,since it isn't included in the first table.

Re: compare two similar dataset

mkeintz — Thu, 16 Feb 2017 03:08:15 GMT

Even so, it probably makes a lot of sense to first eliminate exactly matched records.

Then, with the (hopefully) small set of unmatched records, do some sort of spelling distance or general editting distance between them. Or one could count the number/proportion of common words in every paired comparison.

Of course, this presumes that the op only expects 1 to 1 matches.

Re: compare two similar dataset

ballardw — Thu, 16 Feb 2017 17:35:58 GMT

I would have started with

proc sql;
   create table want as
   select * 
   from sasconf2
   except
   select * from sasconf 
   ;
quit;

to show me everything in the second table not in the first. Which shows new records as well as changes.