BookmarkSubscribeRSS Feed
cho16
Obsidian | Level 7

Hi,

I have requirement to compare two monthly extracts which has same varibales, but values may be changed between two datasets. I need to extract observations that has different value in nay varibale compare to last month dataset. For example below:

 

DATA SasConf;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 newyork NY

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

MWSUG 2006 Dearborn MI

SUGI 2007 Orlando FL

;;;;

DATA SasConf2;

INFILE DATALINES;

INPUT ConfName $

ConfYear

ConfCity $

ConfST $ ;

DATALINES4;

SUGI 2006 San Francisco CA

PHARMASUG 2006 Bonita Springs FL

NESUG 2006 Philadelphia PA

WUSS 2006 Irvine CA

SESUG 2006 Atlanta GA

SCSUG 2006 Irving TX

PNWSUG 2006 Seaside OR

NESUGI 2007 Orlando FL

;;;;

 

Compare these two datasets and the output datset should have following obervations:

NESUG 2006 Philadelphia PA

NESUGI 2007 Orlando FL

 

Note: there is no primary key in the dataset, so any varibale value can be different, so need to select the observation which is changed from last month and also ignore any new observation in the current month dataset.

 

Thank you for all your help.

6 REPLIES 6
Peter_C
Rhodochrosite | Level 12
Have you looked at PROC COMPARE?
Quentin
Super User

I don't see how this will be possible, if there is no key in the datasets.  You say you want to select observations with changed values, and ignore new observations.  If there is no key, how will you tell the difference between a changed observation and a new observation?

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mkeintz
PROC Star

 

If what you want are records that are not completely matched in the two datasets, you could do the following program.

 

It only assumes that both dataset A and B have exactly the same variables. 

 

It will create two datasets: a_only and b_only.   But it does NOT tell you which b_only record is most likely to be associated with an a_only record.

 

 

data a_only (drop=rc);

  set a end=end_of_a;

 

  if _n_=1 then do;

    declare hash b (dataset:'b');

      b.definekey(all:'Y');

      b.definedata(all:'Y');

      b.definedone();

  end;

 

  rc=b.find();

  if rc=0 then rc=b.remove();  /* It's a match: don't output A and remove from hash b*/

  else output a_only;

  if end_of_a then rc=b.output(dataset:'bonly');

run;

 

Edit note:  Also I think there is an "except" operator in proc sql that could provide the unmatched records.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
That is very complicated problem.
You said there are not key variables, how can you measure the similarity  between these two string.

Why would you select
NESUGI 2007 Orlando FL

I think it is a new obs ,since it isn't included in the first table.
mkeintz
PROC Star

Even so, it probably makes a lot of sense to first eliminate exactly matched records. 

 

Then, with the (hopefully) small set of unmatched records, do some sort of spelling distance or general editting distance between them.  Or one could count the number/proportion of common words in every paired comparison.

 

Of course, this presumes that the op only expects 1 to 1 matches. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

I would have started with

 

proc sql;
   create table want as
   select * 
   from sasconf2
   except
   select * from sasconf 
   ;
quit;

to show me everything in the second table not in the first. Which shows new records as well as changes.

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1144 views
  • 0 likes
  • 6 in conversation