topic Re: Sql join / key variables already exists on file in SAS Programming

Sql join / key variables already exists on file

Cruise — Thu, 29 Jun 2017 20:20:05 GMT

Shortcut with alias: A.*, B.* instigates below warnings. Is it dangerous to ignore this warning? In this simple datasets of demo I see no harm checking in the resulting data. But it will be harder to check in my actual where my datasets have hundreds of variables and 4 key vars for full outer join.

WARNING: Variable patient already exists on file WORK.FULL_COAL.
WARNING: Variable medcode already exists on file WORK.FULL_COAL.
WARNING: Variable ID already exists on file WORK.FULL_COAL.
WARNING: Variable PATIENT already exists on file WORK.FULL_COAL.
WARNING: Variable medcode already exists on file WORK.FULL_COAL.
WARNING: Variable ID already exists on file WORK.FULL_COAL.

Ideally,

PROC SQL;
CREATE TABLE FULL_COAL AS
SELECT COALESCE(A.PATIENT, B.PATIENT) AS PATIENT, 
 COALESCE(A.MEDCODE, B.MEDCODE)       AS MEDCODE,
 COALESCE(A.ID, B.ID)                 AS ID, A.*, B.*
FROM DOSING A FULL JOIN EFFICACY B
ON A.PATIENT = B.PATIENT and 
   A.MEDCODE = B.MEDCODE and
   A.ID = B.ID;
QUIT;

Thanks.

Reference: used demo code and data appeared in SUGI paper: http://www2.sas.com/proceedings/sugi25/25/cc/25p109.pdf

Re: Sql join / key variables already exists on file

Reeza — Thu, 29 Jun 2017 00:33:42 GMT

IMO you should not have these warnings. If the variable is in both datasets are you 100% sure that the data is the same for every record you've joined on? If not how do you know which is correct. It's best to select your data directly and if you have variables with the same name you need to correct for that.

Re: Sql join / key variables already exists on file

Astounding — Thu, 29 Jun 2017 00:49:55 GMT

If you are rock solid sure of what SQL will do (and I'm not, I would have to test it which I can't do right now), you might be able to ignore the warnings. First, consider where the warnings come from.

You are creating three MEDCODE variables. One is calculated using COALESCE, one comes from A, and one comes from B. Which one is supposed to be included in your output table? Could the value for MEDCODE actually come from B, when A contains a different value? Could that happen when B contains a missing value?

Most likely, you just have to remove A.* and B.* from the SELECT statement. COALESCE can still work on the values from A and B, even if those values are not being selected. Are there any variables in A or B that you need to read in, that are not going to be used in a COALESCE function?

Re: Sql join / key variables already exists on file

Kurt_Bremser — Thu, 29 Jun 2017 06:45:25 GMT

Don't make indiscriminate use of the asterisk in SQL select lists, especially not when you're joining

- it causes the problems you just experience, and the result might be unpredictable (which value ends up in the output).

- in > 90% of cases you do not need all columns in your further analysis, so you waste disk space and therefore processing time.

- an unexpected change in the input datasets might go undetected in this step, causing havoc later on. Makes code less maintainable.

Maxim 25: Have a clean log.

Re: Sql join / key variables already exists on file

Cruise — Thu, 29 Jun 2017 20:32:37 GMT

@Kurt_Bremser

Hi Kurt,

So you suggest complete iteration of all variables I want as shown in below code which worked out without warning. I like the way you put it. However, in my actual data I have group variables such as for instance one of them is "other_diagnosis1-other_diagnosis_25" and also other is "procedure_diagnosis1-procedure15". I need both group variables which i will use in array to look up for a select cases of diseases for the next step. They're both character variables because ICD10 uses letter initials.

Do you know how to list these two group variables in the sql together?

I tried simply putting "other_diagnosis1-other_diagnosis_25, procedure_diagnosis1-procedure15" which had failed.

PROC SQL;
CREATE TABLE FULL_COAL AS
SELECT COALESCE(A.PATIENT, B.PATIENT) AS PATIENT, 
COALESCE(A.MEDCODE, B.MEDCODE) AS MEDCODE,
COALESCE(A.ID, B.ID) AS ID, A.DOSE_ID, A.DOSEAMT, A.DOSEFRQ, B.EFFIC_ID, B.VISIT, B.SCORE
FROM DOSING A FULL JOIN EFFICACY B
ON A.PATIENT = B.PATIENT and 
A.MEDCODE = B.MEDCODE and
A.ID = B.ID;
QUIT;

Re: Sql join / key variables already exists on file

Reeza — Thu, 29 Jun 2017 20:50:07 GMT

Here's a quick trick that will work but is a bit tedious, use the FEEDBACK option. Note how the alias works. Anyways, look at the log and you'll see the full code and can copy and paste that into your program and remove the variables you don't need.

proc sql FEEDBACK;
create table want as
select *
from sashelp.class as SC;
quit;

Re: Sql join / key variables already exists on file

Cruise — Thu, 29 Jun 2017 20:59:45 GMT

@Reeza

Thanks Reeza,

How would happen to COALESCE option then? I'd like to coalesce because I have to track the origin of missing in the final output by key variables (patient, medcode and id in this demo)?

data dosing;
input patient medcode $ dose_id doseamt dosefrq ID;
datalines;
1001    A      1     2     2 1
1003    A      2     1     2 2
1004    A      3     1     2 3
1004    B      4     4     2 4
1006    B      5     2     2 5
1007    A      6     2     1 6
1008    A      7     1     2 1
1009    A      8     2     2 1
;
run;
data efficacy;
input PATIENT medcode $ EFFIC_ID VISIT SCORE ID;
datalines;
1001 A 1 1 4 1
1002 A 2 1 5 2
1004 A 3 1 2 3
1004 A 4 2 1 4
1005 A 5 1 2 1
1009 A 6 1 5 0
;
run;

PROC SQL FEEDBACK;
CREATE TABLE FULL_COAL AS
SELECT COALESCE(A.PATIENT, B.PATIENT) AS PATIENT, 
       COALESCE(A.MEDCODE, B.MEDCODE) AS MEDCODE,
       COALESCE(A.ID,      B.ID)      AS ID............????
FROM DOSING A FULL JOIN EFFICACY B
 ON A.PATIENT = B.PATIENT and 
    A.MEDCODE = B.MEDCODE and
    A.ID = B.ID;
QUIT;

Re: Sql join / key variables already exists on file

Reeza — Thu, 29 Jun 2017 21:51:40 GMT

@SUNY_Maggie wrote:

@Reeza

Thanks Reeza,

How would happen to COALESCE option then? I'd like to coalesce because I have to track the origin of missing in the final output by key variables (patient, medcode and id in this demo)?

data dosing;
input patient medcode $ dose_id doseamt dosefrq ID;
datalines;
1001    A      1     2     2 1
1003    A      2     1     2 2
1004    A      3     1     2 3
1004    B      4     4     2 4
1006    B      5     2     2 5
1007    A      6     2     1 6
1008    A      7     1     2 1
1009    A      8     2     2 1
;
run;
data efficacy;
input PATIENT medcode $ EFFIC_ID VISIT SCORE ID;
datalines;
1001 A 1 1 4 1
1002 A 2 1 5 2
1004 A 3 1 2 3
1004 A 4 2 1 4
1005 A 5 1 2 1
1009 A 6 1 5 0
;
run;

PROC SQL FEEDBACK;
CREATE TABLE FULL_COAL AS
SELECT COALESCE(A.PATIENT, B.PATIENT) AS PATIENT, 
       COALESCE(A.MEDCODE, B.MEDCODE) AS MEDCODE,
       COALESCE(A.ID,      B.ID)      AS ID............????
FROM DOSING A FULL JOIN EFFICACY B
 ON A.PATIENT = B.PATIENT and 
    A.MEDCODE = B.MEDCODE and
    A.ID = B.ID;
QUIT;

Run it first with your original code that generated the Warning.

Copy the code from the log and remove the duplicate fields. The reason you're having issues is because you create a variable called PATIENT and also selected both variables called PATIENT from the datasets A/B.

However, Since your code joins on the same fields you're using in COALESCE you have a major flaw in your logic. The COALESCE will do nothing because the IDs already match given your join condition.

Re: Sql join / key variables already exists on file

Tom — Fri, 30 Jun 2017 03:31:02 GMT

You can spell out the variables you want to keep instead of using *.

CREATE TABLE FULL_COAL AS
  SELECT COALESCE(A.PATIENT, B.PATIENT) AS PATIENT
       , COALESCE(A.MEDCODE, B.MEDCODE) AS MEDCODE
       , COALESCE(A.ID, B.ID) AS ID
       , A.other1
       , B.other2
  FROM DOSING A
  FULL JOIN EFFICACY B
    ON A.PATIENT = B.PATIENT
   AND A.MEDCODE = B.MEDCODE
   AND A.ID = B.ID
;

You can use RENAME= dataset option on the input datasets and DROP= option on the output dataset.

CREATE TABLE FULL_COAL(drop=a1 a2 a3 b1 b2 b3) AS
  SELECT COALESCE(A.a1, B.b1) AS PATIENT
       , COALESCE(A.a2, B.b2) AS MEDCODE
       , COALESCE(A.a3, B.b3) AS ID
       , *
  FROM DOSING(rename=(patient=a1 medcode=a2 id=a3)) A
  FULL JOIN EFFICACY(rename=(patient=b1 medcode=b2 id=b3)) B
    ON A.A1 = B.B1
   AND A.A2 = B.B2
   AND A.A3 = B.B3
;

If you know that only the key variables are common you can use NATURAL join.

CREATE TABLE FULL_COAL AS
  SELECT *
  FROM DOSING A
  NATURAL FULL JOIN EFFICACY B
;

Re: Sql join / key variables already exists on file

Kurt_Bremser — Fri, 30 Jun 2017 06:27:10 GMT

If you do not have a many-to-many relationship between the tables, then sorting and using a data step merge would be the better solution:

proc sort data=dosing;
by patient medcode id;
run;

proc sort data=efficacy;
by patient medcode id;
run;

data full;
merge
  dosing
  efficacy
;
by patient medcode id;
run;

Since the WARNINGs from your original code show that only the key variables patient, mecode and id are on both datasets, this will work without unintended side effects and WARNINGs.

One really needs SQL only when there is a many-to-many relationship and a cartesian product has to be built.