I'm trying to match a file with each person's contract number with another file that has information about the contracts. The problem is that about 3% of records in the person file are not matching with the contract file, even though their contract ID is present in both files (verified by manual review).
These are alphanumeric codes, and both variables are character. The person file variable is char9, while the contract file variable is char5. I thought there might be padding blanks that were preventing a match, but I added COMPRESS commands and it didn't make any difference. I ran frequencies of the results produced by the COMPRESS commands and they look right to me. All letters are capitalized in both files. Any other ideas about what might be preventing a match? Examples of the contracts that are not matching include 90091, H9585, and R5329.
Unfortunately, the data is confidential and I can't share any of it. I thought about creating example data, but this is only happening in about 3% of cases and I don't know how to make it happen with example data.
PROC SQL;
Create table want
as Select a.*, b.*
From person_file. a JOIN contract_file b
On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;
Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.
Oops, made a typo... here is the corrected code (which still doesn't work!)
PROC SQL;
Create table want
as Select a.*, b.*
From person_file a JOIN contract_file b
On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;
Use the "kad" modifier in the COMPRESS function, so it keeps letters and digits; this will filter out stuff like carriage return and linefeed characters which are not visible.
I'm still getting the same results. Here is my updated code:
PROC SQL;
Create table want
as Select a.*, b.*
From person_file a JOIN contract_file b
On COMPRESS(a.person_contract,,'kad') = COMPRESS(b.contract_id,,'kad');
QUIT;
One simple check is to see if a WHERE statement will find records in both datasets.
If you run:
proc print data=person_file;
where person_contract='90091';
run;
proc print data=contract_file;
where contract_id='90091';
run;
Do both PROC PRINT steps return rows? Hopefully one of them does not, meaning the value '90091' is not in one of the datasets.
Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.
Make sense. Overwriting datasets:
data foo;
set foo;
run;
can cause all kinds of confusion. Glad you got it sorted.
Glad you identified the problem. Mark your own explanation as the solution, so that this topic no longer appears as unsolved.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.