BookmarkSubscribeRSS Feed
Wken1122
Obsidian | Level 7

Hi,
For some reason, my codes keep on removing the duplicate records which I intended to keep.
The scenario is, I have 2 sets of tables, by using a column(POLICYNUMBER) I would join them together. Both table would contain POLICYNUMBERs and there are some records which are duplicates. After processed by SAS, all the records that fits the requirements of the coding are displayed BUT records that fits the requirements but with duplicate POLICYNUMBER are removed from the final result.

The following is a part of the coding;

PROC SQL;
CREATE TABLE CLAIMS_DETAILS AS
(
SELECT ID AS CLAIMID,ODS_CLAIM.POLICYNUMBER,PRODUCTC,CLAIMNUMBER,CLAIMNOTIFIEDDATE,STATUS,OCCURENCEDATE,LOSSTYPE,
TOTALPAIDAMOUNT,INTIMATEDDATE,DATEADMITTED,DATEDISCHARGED,
DATEAPPROVED,LOSSDESCRIPTION,TRANSACTIONNUMBER,APPROVEDAMOUNT,
LOSSDESCRIPTIONCODE,CLAIMENTRYDATE,CAUSEOFEVENTCODE,CLOSEDATE,LOSSDATE,ILLNESSCODE,ILLNESS

FROM ODS_CLAIM, (SELECT DISTINCT(POLICYNUMBER) FROM CH_P02_SOURCE_5) TEMP
WHERE ODS_CLAIM.POLICYNUMBER=TEMP.POLICYNUMBER
AND STATUS IN ('Fully Paid')
);
QUIT;


Sorry if my explanation is a bit confusing cause I am not really sure how to put it into words.
Thank you.

22 REPLIES 22
Reeza
Super User

Note the word DISTINCT. That will select distinct records. And the WHERE clause may remove them as well. 

 

Those are where I'd start checking. 

Wken1122
Obsidian | Level 7

Oh I am sorry, I made a mistake in explaining the scenario.
POLICYNUMBER that are duplicates in one table, SOURCE should be removed.
ALL POLICYNUMBER in the other table should be retained.
That's why DISTINCT is used.

ballardw
Super User

Maybe what you are looking for is a LEFT (or Right) Join on a value.

 

data one;
   input policynumber;
datalines;
1
1
1
2
2
2
3
;
run;

data two;
   input policynumber value;
datalines;
1  345
1  456
1  678
1  2
1  222
2  123
2  444
3  999
3  888
3  777
4  202
;
run;

proc sql;
   create table want as
   select a.*, b.value
   from (select distinct policynumber from one) as a
        left join
        two as b
        on a.policynumber=b.policynumber
   ;
quit;

Notice that values in two for policynumber that don't appear in set one are not in the result and the ones selected from set two only appear once

 

Kurt_Bremser
Super User

In which of the datasets are duplicates found? If only in CH_P02_SOURCE_5, then the distinct clause will remove them.

Duplicates in ODS_CLAIM will remain, see this short example:

data claims;
input pnr $ clid $;
cards;
12345 111
12345 222
56789 333
76477 444
;
run;

data source;
input pnr $;
cards;
12345
12345
56789
56789
;
run;

proc sql;
create table want as
select claims.*
from claims, (select distinct pnr from source) temp
where claims.pnr = temp.pnr
;
quit;

And stop coding in capitals, that's an eyesore. Shouting at the SAS interpreter isn't necessary 😉

Wken1122
Obsidian | Level 7

the intent was to remove the duplicates from source 5 first then compare with ods_claims.
So it is ok for ods_claims' data to retain the duplicates but the final result that i got is where ALL the duplicate POLICYNUMBER (from both tables) are removed.

Patrick
Opal | Level 21

@Wken1122

I can't see anything in your SQL that would explain what you observe unless your other part in the join condition filters out the records.

... AND STATUS IN ('Fully Paid')

Could this explain the "lost" records? Have you checked if there is more than one record per policynumber with a status of "Fully Paid"?

Wken1122
Obsidian | Level 7

There are 1000+ records(after DISTINCT) that are under the status of FULLY PAID yet only 603 of them are generated at SAS.

Patrick
Opal | Level 21

@Wken1122

You're using an inner join so the only remaining explanation I'm having is that there are POLICYNUMBER values in the one table which don't exist in the other table.

 

If variable POLICYNUMBER is character then use upcase(strip(POLICYNUMBER)) before you join.

 

What you observe must have an explanation in your data. Investigate one of the dropped POLICYNUMBER and check if it really exists in both tables. And if it exists and looks exactly the same in both source tables then also consider non-printable characters - you can remove such characters using the compress() function with the right switches.

 

...and last but not least: Also "standardize" fully paid:

AND upcase(compress(STATUS)) IN ('FULLYPAID')
Wken1122
Obsidian | Level 7

@Patrick
I'm sorry, i mistakenly explained the scenario.
For the SOURCE_5, any duplicates value will be removed while duplicate values on the other table, CLAIMS MUST be retained.

Also I had already done the checking, from the SOURCE_5 table, exist a one to many relation to the CLAIMS table, where for example, one insurance policy could have several claims. Therefore, all the duplicate POLICYNUMBER on the SOURCE_5 are removed thru the DISTINCT function and the remaining POLICYNUMBER would and could match with more than one POLICYNUMBER on CLAIMS table.

variable POLICYNUMBER is alphanumeric so does it affect anything?

I had try to standardize STATUS but don't seems to change anything.

Patrick
Opal | Level 21

@Wken1122 wrote:

@Patrick
I'm sorry, i mistakenly explained the scenario.
For the SOURCE_5, any duplicates value will be removed while duplicate values on the other table, CLAIMS MUST be retained.

Also I had already done the checking, from the SOURCE_5 table, exist a one to many relation to the CLAIMS table, where for example, one insurance policy could have several claims. Therefore, all the duplicate POLICYNUMBER on the SOURCE_5 are removed thru the DISTINCT function and the remaining POLICYNUMBER would and could match with more than one POLICYNUMBER on CLAIMS table.

variable POLICYNUMBER is alphanumeric so does it affect anything?

I had try to standardize STATUS but don't seems to change anything.


 

@Wken1122

Yes, I understand, you're joining (inner join) two tables with a 1:M relationship. If you "loose" records then it's because there are values for POLICENUMBER in CLAIMS which don't exist in SOURCE_5. 

 

variable POLICYNUMBER is alphanumeric so does it affect anything?

Then to be on the safe side make sure you strip() or compress() the values as well as uppercase() them BEFORE you use them in the join condition.

Wken1122
Obsidian | Level 7
DATA CH_P02_SOURCE_5;
SET CH_P02_SOURCE_5;
WHERE COMPRESS(strip(POLICYNUMBER));
RUN;

Am I using it correctly? If yes then it didn't change a thing...
Patrick
Opal | Level 21

@Wken1122

If below doesn't return what you expect then I won't be able to help you any further without actually seeing the data.

WHERE 
  upcase(compress(compress(CLAIM.POLICYNUMBER,,'kw')))=upcase(compress(compress(TEMP.POLICYNUMBER,,'kw')))
  and upcase(compress(compress(STATUS,,'kw')))= 'FULLYPAID'
Tom
Super User Tom
Super User

It does not look like you need to do a join.

proc sql;
create table claims_details as
select id as claimid
     , policynumber
     , productc
     , claimnumber
     , claimnotifieddate
     , status
     , occurencedate
     , losstype
     , totalpaidamount
     , intimateddate
     , dateadmitted
     , datedischarged
     , dateapproved
     , lossdescription
     , transactionnumber
     , approvedamount
     , lossdescriptioncode
     , claimentrydate
     , causeofeventcode
     , closedate
     , lossdate
     , illnesscode
     , illness
from ods_claim
where status in ('Fully Paid')
  and policynumber in (select distinct policynumber from ch_p02_source_5)
;
quit;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 1865 views
  • 2 likes
  • 6 in conversation