Selecting for One Observation Among Two of the Same

jcis7 · Posted 08-02-2013 07:09 PM

Hello! Any help you can give is appreciated!

I've DataFile1:

ID Type Address (other fields, etc...)

12345678 B 1 Aurora Ln

12345676 A 1 Aurora Ln

12345677 A 30 Chester Way

12345667 B 30 Chester Wy

12345654 A 45 Picadilly Ln

12345432 B 45 Picadilly Ln

I've another data file, Data File 2, that has the ID listed for the Type B's from DataFile1 but for which I need to replace the ID listed from DataFile1 with the Type A ID. The address for DataFile 2 may not be the same as the address for data file 1.

Data File 2 also has other IDs that are correct and therefore don't need updating.

Data File 2

ID Type Address (other fields, etc...)

---observations with the same ID listed in DataFile1 but may have updated information

12345678 B 1 Aurora Ln #B

12345667 B 305 Chester Wy

12345432 B 45 Picadilly Ln

.....more observations that don't need the ID number changed

13456787 A 300 Westinghouse

14556764 B 134 Hemingway

Need: Final Table

ID Type Address (other fields, etc...)

12345676 A 1 Aurora Ln #B

12345677 A 305 Chester Way

12345654 A 45 Picadilly Ln

13456787 A 300 Westinghouse

14556764 B 134 Hemingway

Thank you!

Fugue · Posted 08-02-2013 07:54 PM

It's easy to match the ID-TYPE in DataFile1 to ID-TYPE in DataFile2. But, I can see that you have a challenge on your hands trying to then figure out which IDs in DataFile1 are related to the same address so that you can output the ID associated with TYPE = A. There is no obvious key between TYPE = A and TYPE = B. It looks like you will have to use some kind of fuzzy merge matching to figure out which addresses in DataFile1 are related . . .

You could try something like this (adapted from http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf). Note: I'm not sure how well COMPGED deals with digits or punctuation. Also, you might want to strip out most punctuation and control characters before doing the comparison.

%let maxscore=999;

proc sql;

create table matches as

select a.id as a_id , a.address as addressA, b.id as b_id , b.address as addressB,

compged(a.address,b.address,&maxscore,'iL' ) as gedscore

from DataFile1 a, DataFile1 b

where a.id < b.id

and calculated gedscore < &maxscore

order by calculated gedscore;

quit;

Vince28_Statcan · Posted 08-02-2013 09:22 PM

I think you need to develop a little more on your data and intended results concept. When I look at your data example, it reads as though your datafile1 has 2 records per case, one with old (as matched with datafile2) and one with updated info and our aim is to more or less update datafile2?

If that's the case and you have additional information about your files structure, you might be able to use clever lag functions and the like to achieve your desired results. For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index.

Anyway point being, if your data has a specific structure, there are probably far more reliable ways to process your data than eventually having to rely on string distance metrics to match addresses as the update to your "key" id is making processing very painful.

Nonetheless, if this is something that might be repeated over time and cases often change ID, you should request for a 3rd dataset with a date or datetime stamp and 2 more variable - old_id new_id as to document any such change in your primary keys.

Vince

Fugue · Posted 08-02-2013 09:41 PM

Vince has a point - if DataFile1 is always arranged in some kind of inherently logical structure (e.g. if there are always two sequential records (one with Type=A, one with Type=B, in either order) for each "address"), then it is feasible to link the ID for the Type A record to the ID for the Type B record. This would eliminate the need for fuzzy logic.

jcis7 · Posted 08-05-2013 12:43 PM

I can only think of merging the first two observations, retaining each of the IDs (i.e, rename ID_A and ID_B). Then, matching or link (how would you do that) the ID associated with Type B in datafile1 with ID_B from Datafile 1, keeping all observations. Then replacing the ID in Datafile 1 with ID_A. How would you do this? Linking? Matching?

Fugue · Posted 08-05-2013 03:39 PM

If Datafile1 is always logically arranged so that two sequential records always relate to the same addess (one having TYPE=A and one having TYPE=B), then try something like this:

/* assumes records in DataFile1 are sorted so that every two sequential records related to the same "address" */
data matchid (rename=(id_typeb=matchid));
set DataFile1 (keep = id type);
keep id_typea id_typeb ;
retain id_typea id_typeb ;
mod = mod ( _N_ , 2 ) ;
if _N_ = 1 then do ;
id_typea = . ;
id_typeb = . ;
end ;
if mod <> 0 then do ;
id_typea = . ;
id_typeb = . ;
end ;
if type = 'A' then id_typea = id ;
else if type = 'B' then id_typeb = id ;
/* only output every 2nd record */
if mod = 0 then output ;
run ;

proc sort data=datafile2 ;
by id ;
proc sort data=matchid ;
by matchid ;

data want (rename = ( new_id=id new_type=type ));
merge datafile2 ( in=a ) matchid ( in=b rename=(matchid = id )) ;
drop id type id_typea ;
by id;
if id_typea ne . then do ;
new_id = id_typea ;
new_type = 'A';
end;
else do ;
new_id = id ;
new_type = type ;
end ;
if a;/* keep all records in datafile2 */
run ;

jcis7 · Posted 08-05-2013 05:05 PM

Thanks! Am giving this a try!!

jcis7 · Posted 08-05-2013 12:37 PM

Thank you both. This is a one-time event. Type A may be a higher number than type B or it may not be. How do you link the ID? Can you explain this or refer me to a resource that explains this: 'For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index'

Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Re: Selecting for One Observation Among Two of the Same

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away