Solved: PROC SQL match missing about 3% of cases even though they are in both ...

Wolverine · Posted 08-26-2025 01:12 PM

I'm trying to match a file with each person's contract number with another file that has information about the contracts. The problem is that about 3% of records in the person file are not matching with the contract file, even though their contract ID is present in both files (verified by manual review).

These are alphanumeric codes, and both variables are character. The person file variable is char9, while the contract file variable is char5. I thought there might be padding blanks that were preventing a match, but I added COMPRESS commands and it didn't make any difference. I ran frequencies of the results produced by the COMPRESS commands and they look right to me. All letters are capitalized in both files. Any other ideas about what might be preventing a match? Examples of the contracts that are not matching include 90091, H9585, and R5329.

Unfortunately, the data is confidential and I can't share any of it. I thought about creating example data, but this is only happening in about 3% of cases and I don't know how to make it happen with example data.

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file. a JOIN contract_file b
	On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;

Wolverine · Posted 08-26-2025 04:15 PM

Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.

View solution in original post

Wolverine · Posted 08-26-2025 01:16 PM

Oops, made a typo... here is the corrected code (which still doesn't work!)

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file a JOIN contract_file b
	On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;

Kurt_Bremser · Posted 08-26-2025 01:48 PM

Use the "kad" modifier in the COMPRESS function, so it keeps letters and digits; this will filter out stuff like carriage return and linefeed characters which are not visible.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Wolverine · Posted 08-26-2025 02:00 PM

I'm still getting the same results. Here is my updated code:

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file a JOIN contract_file b
	On COMPRESS(a.person_contract,,'kad') = COMPRESS(b.contract_id,,'kad');
QUIT;

Quentin · Posted 08-26-2025 03:34 PM

One simple check is to see if a WHERE statement will find records in both datasets.

If you run:

proc print data=person_file;
  where person_contract='90091';
run;

proc print data=contract_file;
  where contract_id='90091';
run;

Do both PROC PRINT steps return rows? Hopefully one of them does not, meaning the value '90091' is not in one of the datasets.

Wolverine · Posted 08-26-2025 04:15 PM

Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.

Quentin · Posted 08-26-2025 04:57 PM

Make sense. Overwriting datasets:

data foo;
  set foo;
run;

can cause all kinds of confusion. Glad you got it sorted.

mkeintz · Posted 08-27-2025 08:52 PM

@Wolverine

Glad you identified the problem. Mark your own explanation as the solution, so that this topic no longer appears as unsolved.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Registration is open

Registration is open

SAS Training: Just a Click Away