Hello! Any help you can give is appreciated!
I've DataFile1:
ID Type Address (other fields, etc...)
12345678 B 1 Aurora Ln
12345676 A 1 Aurora Ln
12345677 A 30 Chester Way
12345667 B 30 Chester Wy
12345654 A 45 Picadilly Ln
12345432 B 45 Picadilly Ln
I've another data file, Data File 2, that has the ID listed for the Type B's from DataFile1 but for which I need to replace the ID listed from DataFile1 with the Type A ID. The address for DataFile 2 may not be the same as the address for data file 1.
Data File 2 also has other IDs that are correct and therefore don't need updating.
Data File 2
ID Type Address (other fields, etc...)
---observations with the same ID listed in DataFile1 but may have updated information
12345678 B 1 Aurora Ln #B
12345667 B 305 Chester Wy
12345432 B 45 Picadilly Ln
.....more observations that don't need the ID number changed
13456787 A 300 Westinghouse
14556764 B 134 Hemingway
Need: Final Table
ID Type Address (other fields, etc...)
12345676 A 1 Aurora Ln #B
12345677 A 305 Chester Way
12345654 A 45 Picadilly Ln
13456787 A 300 Westinghouse
14556764 B 134 Hemingway
Thank you!
It's easy to match the ID-TYPE in DataFile1 to ID-TYPE in DataFile2. But, I can see that you have a challenge on your hands trying to then figure out which IDs in DataFile1 are related to the same address so that you can output the ID associated with TYPE = A. There is no obvious key between TYPE = A and TYPE = B. It looks like you will have to use some kind of fuzzy merge matching to figure out which addresses in DataFile1 are related . . .
You could try something like this (adapted from http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf). Note: I'm not sure how well COMPGED deals with digits or punctuation. Also, you might want to strip out most punctuation and control characters before doing the comparison.
%let maxscore=999;
proc sql;
create table matches as
select a.id as a_id , a.address as addressA, b.id as b_id , b.address as addressB,
compged(a.address,b.address,&maxscore,'iL' ) as gedscore
from DataFile1 a, DataFile1 b
where a.id < b.id
and calculated gedscore < &maxscore
order by calculated gedscore;
quit;
I think you need to develop a little more on your data and intended results concept. When I look at your data example, it reads as though your datafile1 has 2 records per case, one with old (as matched with datafile2) and one with updated info and our aim is to more or less update datafile2?
If that's the case and you have additional information about your files structure, you might be able to use clever lag functions and the like to achieve your desired results. For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index.
Anyway point being, if your data has a specific structure, there are probably far more reliable ways to process your data than eventually having to rely on string distance metrics to match addresses as the update to your "key" id is making processing very painful.
Nonetheless, if this is something that might be repeated over time and cases often change ID, you should request for a 3rd dataset with a date or datetime stamp and 2 more variable - old_id new_id as to document any such change in your primary keys.
Vince
Vince has a point - if DataFile1 is always arranged in some kind of inherently logical structure (e.g. if there are always two sequential records (one with Type=A, one with Type=B, in either order) for each "address"), then it is feasible to link the ID for the Type A record to the ID for the Type B record. This would eliminate the need for fuzzy logic.
I can only think of merging the first two observations, retaining each of the IDs (i.e, rename ID_A and ID_B). Then, matching or link (how would you do that) the ID associated with Type B in datafile1 with ID_B from Datafile 1, keeping all observations. Then replacing the ID in Datafile 1 with ID_A. How would you do this? Linking? Matching?
If Datafile1 is always logically arranged so that two sequential records always relate to the same addess (one having TYPE=A and one having TYPE=B), then try something like this:
/* assumes records in DataFile1 are sorted so that every two sequential records related to the same "address" */
data matchid (rename=(id_typeb=matchid));
set DataFile1 (keep = id type);
keep id_typea id_typeb ;
retain id_typea id_typeb ;
mod = mod ( _N_ , 2 ) ;
if _N_ = 1 then do ;
id_typea = . ;
id_typeb = . ;
end ;
if mod <> 0 then do ;
id_typea = . ;
id_typeb = . ;
end ;
if type = 'A' then id_typea = id ;
else if type = 'B' then id_typeb = id ;
/* only output every 2nd record */
if mod = 0 then output ;
run ;
proc sort data=datafile2 ;
by id ;
proc sort data=matchid ;
by matchid ;
data want (rename = ( new_id=id new_type=type ));
merge datafile2 ( in=a ) matchid ( in=b rename=(matchid = id )) ;
drop id type id_typea ;
by id;
if id_typea ne . then do ;
new_id = id_typea ;
new_type = 'A';
end;
else do ;
new_id = id ;
new_type = type ;
end ;
if a;/* keep all records in datafile2 */
run ;
Thanks! Am giving this a try!!
Thank you both. This is a one-time event. Type A may be a higher number than type B or it may not be. How do you link the ID? Can you explain this or refer me to a resource that explains this: 'For instance, if datafile1 is automated system generated and always produces exactly 2 records in a row of the same case there can be fairly easy processing done between array _temporary_, hash tables and a custom new build index on datafile1 using like ceil(_N_/2) to create the index'
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.