Validation of output produced by join query

Babloo · Posted 04-20-2018 03:49 AM

Is there anyway to manually validate the output produced by join query? Each of the below dataset has around 2 lakshs of observations and in the output I see only ~1000 observations. Snippet of my query is given below.

Please note that I tried to read only 1000 observations (for testing) from each of the dataset and thereafter when I run the query, I ended up with 0 observations with no warnings/errors. Is there any better way to validate my results?

from
      work.D_CLM_CLAIM, 
      dextcom.F_CLM_CLAIM, 
      work.W701UAWK, 
      dextcom.D_POL_PRODUCT, 
      dextcom.D_CST_CUSTOMER, 
      dextcom.D_POL_POLICY, 
      dextcom.D_CLM_CLAIM_UNIT, 
      dextcom.D_COV_PERIL, 
      dextcom.D_CLM_CLAIM_AMT_TYPE
   where
      F_CLM_CLAIM.CLAIM_SK = D_CLM_CLAIM.CLAIM_SK
      and F_CLM_CLAIM.INTERMEDIARY_SK = W701UAWK.INTERMEDIARY_SK
      and F_CLM_CLAIM.PRODUCT_SK = D_POL_PRODUCT.PRODUCT_SK
      and F_CLM_CLAIM.CUSTOMER_SK = D_CST_CUSTOMER.CUSTOMER_SK
      and F_CLM_CLAIM.POLICY_SK = D_POL_POLICY.POLICY_SK
      and F_CLM_CLAIM.CLAIM_UNIT_SK = D_CLM_CLAIM_UNIT.CLAIM_UNIT_SK
      and F_CLM_CLAIM.PERIL_SK = D_COV_PERIL.PERIL_SK
      and F_CLM_CLAIM.CLAIM_AMT_TYPE_SK = D_CLM_CLAIM_AMT_TYPE.CLAIM_AMT_TYPE_SK
   order by
      D_CLM_CLAIM.SDI,
      D_COV_PERIL.PERIL_CD,
      D_CLM_CLAIM_AMT_TYPE.PART_CD,
      D_CLM_CLAIM_AMT_TYPE.CLAIM_AMT_TYPE_CD,
      D_CLM_CLAIM_AMT_TYPE.SUBTYPE_CD
        ;

RW9 · Posted 04-20-2018 04:19 AM

There is no magic tool to make your code work, knowledge of your data, and building code based on a well documented process are the two main drivers to success. What does the documentation say about how these tables interact, and what the linking variables are? Not done documentation, well there's your problem.

Also, unless there is a specific need to do everything in one block, simplfy it and do it in several steps, you can then check each part of the merge. You may also want to be more specific about the joining, i.e. specifying left/right and such like joins. To me it reads clearer if you state base dataset, then how the other data is merged on, or do it with datasteps:

data step1;
  merge dextcom.f_clm_claim d_clm_claim;
  by claim_sk;
run;

Its much easier to see the outcome of this join now in a dataset.

At the end of the day, the code needs to be very simple to read and understand by programmers in the future.

Babloo · Posted 04-20-2018 06:10 AM

Thank you for the explanation. If you want to do the testing three datasets then how should I proceed? I tried the following code and end up with the error.


24         data step1;
25           merge dextcom.f_clm_claim d_clm_claim dextcom.D_POL_PRODUCT;
26           by claim_sk PRODUCT_SK;
27         run;

ERROR: BY variable PRODUCT_SK is not on input data set WORK.D_CLM_CLAIM.
ERROR: BY variable CLAIM_SK is not on input data set DEXTCOM.D_POL_PRODUCT.

In case if I continue to do testing for two datasets as you mentioned (or you can see the example below), then how can I combine the query/datasets at the last to satisfy the join condition as mentioned in my initial post?

data step2;
  merge dextcom.f_clm_claim dextcom.d_cov_peril;
  by peril_sk;
run;

RW9 · Posted 04-20-2018 06:21 AM

No, you don't seem to be following. The point is to not have a big long list of merges, but to split these up into one table to one table merges. In my presented code I have taken the first part of your code:

from
      work.D_CLM_CLAIM, 
      dextcom.F_CLM_CLAIM
   where
      F_CLM_CLAIM.CLAIM_SK = D_CLM_CLAIM.CLAIM_SK

And put this into one step:

data step1;
  merge dextcom.f_clm_claim d_clm_claim;
  by claim_sk;
run;

With this approach you would then take the step1 dataset and merge on the next data by the given variable, so building up the data step by step. In this way you can see exactly what records are merged in at each step, and hence debug the code. So next one would be:

data step2;
  merge step1 work.w701uawk;
  by intermediary_sk;
run;

Etc.

jklaverstijn · Posted 04-20-2018 04:20 AM

My approach would be to use the query on input datasets with test data and a known good reference result set.Make sure that your test data contains all scenarios of good and bad data that can possibly exist so you make sure your code covers them all.

If you are uncertain about the result with "just" live data this may give you some clues. Using real data may actually make you miss certain scenario's that are just not present in the data at that time (aka incomplete coverage testing) but may do so in the future. Also you have no reference to compare your results with other than your intuition. With many thousands of rows that intuition will rapidly fail you.

I have seen many of these situations that made me convinced that proper testing is a science in its own. A science by the way that I have not graduated in 😉

Hope this helps,

- Jan.

Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Re: Validation of output produced by join query

Registration is open

SAS Training: Just a Click Away