Solved: PROC SQL match missing about 3% of cases even though they are in both ...

Wolverine

I'm trying to match a file with each person's contract number with another file that has information about the contracts. The problem is that about 3% of records in the person file are not matching with the contract file, even though their contract ID is present in both files (verified by manual review).

These are alphanumeric codes, and both variables are character. The person file variable is char9, while the contract file variable is char5. I thought there might be padding blanks that were preventing a match, but I added COMPRESS commands and it didn't make any difference. I ran frequencies of the results produced by the COMPRESS commands and they look right to me. All letters are capitalized in both files. Any other ideas about what might be preventing a match? Examples of the contracts that are not matching include 90091, H9585, and R5329.

Unfortunately, the data is confidential and I can't share any of it. I thought about creating example data, but this is only happening in about 3% of cases and I don't know how to make it happen with example data.

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file. a JOIN contract_file b
	On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;

Wolverine

Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.

View solution in original post

Wolverine

Oops, made a typo... here is the corrected code (which still doesn't work!)

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file a JOIN contract_file b
	On COMPRESS(a.person_contract) = COMPRESS(b.contract_id);
QUIT;

Kurt_Bremser

Use the "kad" modifier in the COMPRESS function, so it keeps letters and digits; this will filter out stuff like carriage return and linefeed characters which are not visible.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Wolverine

I'm still getting the same results. Here is my updated code:

PROC SQL;
	Create table want
		as Select a.*, b.*
	From person_file a JOIN contract_file b
	On COMPRESS(a.person_contract,,'kad') = COMPRESS(b.contract_id,,'kad');
QUIT;

Quentin

One simple check is to see if a WHERE statement will find records in both datasets.

If you run:

proc print data=person_file;
  where person_contract='90091';
run;

proc print data=contract_file;
  where contract_id='90091';
run;

Do both PROC PRINT steps return rows? Hopefully one of them does not, meaning the value '90091' is not in one of the datasets.

Wolverine

Well I think I found the problem... the version of the file I was manually looking at was not the most recent version. The most recent version had been re-created and that re-introduced an earlier error I had already corrected. In my defense, I'm working with someone else's code -- I always create a new dataset name when I add or remove cases, whereas this code re-uses the same name. So it's a lot harder to tell if I'm working with a version of the dataset from before or after a critical step in the code.

Quentin

Make sense. Overwriting datasets:

data foo;
  set foo;
run;

can cause all kinds of confusion. Glad you got it sorted.

mkeintz

@Wolverine

Glad you identified the problem. Mark your own explanation as the solution, so that this topic no longer appears as unsolved.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

Re: PROC SQL match missing about 3% of cases even though they are in both files

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away