DATA Step, Macro, Functions and more

compare two similar dataset

Reply
Contributor
Posts: 33

compare two similar dataset

Hi,

I have requirement to compare two monthly extracts which has same varibales, but values may be changed between two datasets. I need to extract observations that has different value in nay varibale compare to last month dataset. For example below:

 

DATA SasConf;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 newyork NY

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

MWSUG 2006 Dearborn MI

SUGI 2007 Orlando FL

;;;;

DATA SasConf2;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 Philadelphia PA

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

PNWSUG 2006 Seaside OR

NESUGI 2007 Orlando FL

;;;;

 

Compare these two datasets and the output datset should have following obervations:

NESUG 2006 Philadelphia PA

NESUGI 2007 Orlando FL

 

Note: there is no primary key in the dataset, so any varibale value can be different, so need to select the observation which is changed from last month and also ignore any new observation in the current month dataset.

 

Thank you for all your help.

Valued Guide
Posts: 2,177

Re: compare two similar dataset

Have you looked at PROC COMPARE?
PROC Star
Posts: 1,324

Re: compare two similar dataset

I don't see how this will be possible, if there is no key in the datasets.  You say you want to select observations with changed values, and ignore new observations.  If there is no key, how will you tell the difference between a changed observation and a new observation?

Trusted Advisor
Posts: 1,022

Re: compare two similar dataset

[ Edited ]

 

If what you want are records that are not completely matched in the two datasets, you could do the following program.

 

It only assumes that both dataset A and B have exactly the same variables. 

 

It will create two datasets: a_only and b_only.   But it does NOT tell you which b_only record is most likely to be associated with an a_only record.

 

 

data a_only (drop=rc);

  set a end=end_of_a;

 

  if _n_=1 then do;

    declare hash b (dataset:'b');

      b.definekey(all:'Y');

      b.definedata(all:'Y');

      b.definedone();

  end;

 

  rc=b.find();

  if rc=0 then rc=b.remove();  /* It's a match: don't output A and remove from hash b*/

  else output a_only;

  if end_of_a then rc=b.output(dataset:'bonly');

run;

 

Edit note:  Also I think there is an "except" operator in proc sql that could provide the unmatched records.

Super User
Posts: 10,044

Re: compare two similar dataset

That is very complicated problem.
You said there are not key variables, how can you measure the similarity  between these two string.

Why would you select
NESUGI 2007 Orlando FL

I think it is a new obs ,since it isn't included in the first table.
Trusted Advisor
Posts: 1,022

Re: compare two similar dataset

Even so, it probably makes a lot of sense to first eliminate exactly matched records. 

 

Then, with the (hopefully) small set of unmatched records, do some sort of spelling distance or general editting distance between them.  Or one could count the number/proportion of common words in every paired comparison.

 

Of course, this presumes that the op only expects 1 to 1 matches. 

Super User
Posts: 11,343

Re: compare two similar dataset

I would have started with

 

proc sql;
   create table want as
   select * 
   from sasconf2
   except
   select * from sasconf 
   ;
quit;

to show me everything in the second table not in the first. Which shows new records as well as changes.

 

Ask a Question
Discussion stats
  • 6 replies
  • 185 views
  • 0 likes
  • 6 in conversation