DATA Step, Macro, Functions and more

Selecting for One Observation Among Two of the Same

Reply
Regular Contributor
Posts: 194

Selecting for One Observation Among Two of the Same

Hello!  Any help you can give is appreciated!

I've DataFile1:

ID                  Type       Address             (other fields, etc...)

12345678      B             1 Aurora Ln                            

12345676      A             1 Aurora Ln                              

12345677      A             30 Chester Way                 

12345667      B             30 Chester Wy                  

12345654      A             45 Picadilly Ln

12345432      B             45 Picadilly Ln           

I've another data file, Data File 2,  that has the ID listed for the Type B's from DataFile1 but for which I need to replace the ID listed from DataFile1 with the Type A ID.  The address for DataFile 2 may not be the same as the address for data file 1.

Data File 2 also has other IDs that are correct and therefore don't need updating.

Data File 2

ID                  Type       Address             (other fields, etc...)

---observations with the same ID listed in DataFile1 but may have updated information

12345678      B             1 Aurora Ln  #B                                                                     

12345667      B             305 Chester Wy                  

12345432      B             45 Picadilly Ln 

.....more observations that don't need the ID number changed

13456787      A             300 Westinghouse

14556764      B             134 Hemingway

Need:  Final Table

ID                  Type       Address             (other fields, etc...)                       

12345676      A             1 Aurora Ln #B                               

12345677      A             305 Chester Way                               

12345654      A             45 Picadilly Ln

13456787      A             300 Westinghouse

14556764      B             134 Hemingway

Thank you!

Super Contributor
Posts: 307

Re: Selecting for One Observation Among Two of the Same

It's easy to match the ID-TYPE in DataFile1 to ID-TYPE in DataFile2. But, I can see that you have a challenge on your hands trying to then figure out which IDs in DataFile1 are related to the same address so that you can output the ID associated with TYPE = A. There is no obvious key between TYPE = A and TYPE = B. It looks like you will have to use some kind of fuzzy merge matching to figure out which addresses in DataFile1 are related . . .

You could try something like this (adapted from http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf). Note: I'm not sure how well COMPGED deals with digits or punctuation. Also, you might want to strip out most punctuation and control characters before doing the comparison.

%let maxscore=999;

proc sql;

  create table matches as

  select a.id as a_id , a.address as addressA, b.id as b_id , b.address as addressB, 

         compged(a.address,b.address,&maxscore,'iL' ) as gedscore

    from DataFile1 a, DataFile1 b

   where a.id < b.id  

     and calculated gedscore < &maxscore

   order by calculated gedscore;

quit;

Super Contributor
Posts: 339

Re: Selecting for One Observation Among Two of the Same

I think you need to develop a little more on your data and intended results concept. When I look at your data example, it reads as though your datafile1 has 2 records per case, one with old (as matched with datafile2) and one with updated info and our aim is to more or less update datafile2?

If that's the case and you have additional information about your files structure, you might be able to use clever lag functions and the like to achieve your desired results. For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index.

Anyway point being, if your data has a specific structure, there are probably far more reliable ways to process your data than eventually having to rely on string distance metrics to match addresses as the update to your "key" id is making processing very painful.

Nonetheless, if this is something that might be repeated over time and cases often change ID, you should request for a 3rd dataset with a date or datetime stamp and 2 more variable - old_id new_id as to document any such change in your primary keys.

Vince

Super Contributor
Posts: 307

Re: Selecting for One Observation Among Two of the Same

Posted in reply to Vince28_Statcan

Vince has a point - if DataFile1 is always arranged in some kind of inherently logical structure (e.g. if there are always two sequential records (one with Type=A, one with Type=B, in either order) for each "address"), then it is feasible to link the ID for the Type A record to the ID for the Type B record. This would eliminate the need for fuzzy logic.

Regular Contributor
Posts: 194

Re: Selecting for One Observation Among Two of the Same

I can only think of merging the first two observations, retaining each of the IDs (i.e, rename ID_A  and ID_B). Then, matching or link (how would you do that)  the ID associated with Type B  in datafile1 with  ID_B from Datafile 1, keeping all observations.  Then replacing the ID in Datafile 1 with ID_A.  How would you do this?  Linking?  Matching? 

Super Contributor
Posts: 307

Re: Selecting for One Observation Among Two of the Same

If Datafile1 is always logically arranged so that two sequential records always relate to the same addess (one having TYPE=A and one having TYPE=B), then try something like this:


/* assumes records in DataFile1 are sorted so that every two sequential records related to the same "address" */
data matchid (rename=(id_typeb=matchid));
set DataFile1 (keep = id type);
keep id_typea id_typeb ;
retain id_typea id_typeb ;
mod = mod ( _N_ , 2 ) ;
if _N_ = 1 then do ;
  id_typea = . ;
  id_typeb = . ;
end ;
if mod <> 0 then do ;
  id_typea = . ;
  id_typeb = . ;
end ;
if type = 'A' then id_typea = id ;
else if type = 'B' then id_typeb = id ;
/* only output every 2nd record */
if mod = 0 then output ;
run ;

proc sort data=datafile2 ;
by id ;
proc sort data=matchid ;
by matchid ;

data want (rename = ( new_id=id new_type=type ));
merge datafile2 ( in=a ) matchid ( in=b rename=(matchid = id )) ;
drop id type id_typea ;
by id;
if id_typea ne . then do ;
  new_id = id_typea ;
  new_type = 'A';
end;
else do ;
  new_id = id ;
  new_type = type ;
end ;
if a;/* keep all records in datafile2 */
run ;

Regular Contributor
Posts: 194

Re: Selecting for One Observation Among Two of the Same

Thanks! Am giving this a try!!

Regular Contributor
Posts: 194

Re: Selecting for One Observation Among Two of the Same


Thank you both. This is a one-time event.  Type A may be a higher number than type B or it may not be.  How do you link the ID?  Can you explain this or refer me to a resource that explains this:  'For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index'

Ask a Question
Discussion stats
  • 7 replies
  • 262 views
  • 0 likes
  • 3 in conversation