DATA Step, Macro, Functions and more

More efficient way of proc sql joining ON a disjunction

Accepted Solution Solved
Reply
Contributor
Posts: 46
Accepted Solution

More efficient way of proc sql joining ON a disjunction

[ Edited ]

SAS Enterprise Guide 5.1

 

I have two similar data sets, say Table1 and Table2 (basically they structurally are the same data set similar variables (with different cell values) regarding different periods), with 64 columns and approximately 550K rows.

 

I'd like to do the following.

 

                  PROC SQL;
                  CREATE TABLE JOINED TABLE AS
                  SELECT Some variables
                  FROM Table1 T1
                  LEFT JOIN Table2 T2 ON ((T1.Var1=T2.Var1 AND T1.Var2=T2.Var2) OR T1.Var1=T2.Var2);
                  QUIT;

 

But it's just too slow.

 

Is there a (much) more efficient way of achieving the same result?

 

Thanks.

 

 

Here's an example.

 

DATA Data201606;
    LENGTH
        Var1             $ 14
        Var2             $ 14
        WholeBunchOfVars_201606 $ 24 ;
    FORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201606 $CHAR24. ;
    INFORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201606 $CHAR24. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Var1             : $CHAR14.
        Var2             : $CHAR14.
        WholeBunchOfVars_201606 : $CHAR24. ;
DATALINES4;
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
42001223829005420012238221001913672811803603960
42005537849003420055378411001728228741505881940
42000000420200 149282958163677760150000
42000001251100 137224761123272858132500
42000002571100 115006036153103523280000
42000002992100 176705643184314704200000
42000003721100 134284160134284151225000
;;;;

DATA Data201607;
    LENGTH
        Var1             $ 14
        Var2             $ 14
        WholeBunchOfVars_201607 $ 24 ;
    FORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201607 $CHAR24. ;
    INFORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201607 $CHAR24. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Var1             : $CHAR14.
        Var2             : $CHAR14.
        WholeBunchOfVars_201607 : $CHAR24. ;
DATALINES4;
42001223829005420012238221001913672811803603960
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
48073274008100...50000
48077274928100...150000
48120011318100...200000
48097297288100...200000
48001314298100...250000
48035295388100...575000
48089429268100...600000
48001012248100...600000
48075642348100...700000
42026909980100.000000000.25000
42001115759100.000000000145868575112500
42014081290100.00000000015346509325000
48000000048500.001241385.50000
48000000048400.001241385.150000
;;;;

PROC SQL;
CREATE TABLE JOINED_TABLE AS 
SELECT *
FROM Data201606 T1
LEFT JOIN Data201607 T2 ON ((T1.Var1=T2.Var1 AND T1.Var2=T2.Var2) OR T1.Var1=T2.Var2);
QUIT;

Accepted Solutions
Solution
‎07-29-2016 05:05 AM
Super User
Posts: 9,687

Re: More efficient way of proc sql joining ON a disjunction

You didn't post the output when T1.Var1=T2.Var2  yet ?




DATA Data201606;
    LENGTH
        Var1             $ 14
        Var2             $ 14
        WholeBunchOfVars_201606 $ 24 ;
    FORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201606 $CHAR24. ;
    INFORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201606 $CHAR24. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Var1             : $CHAR14.
        Var2             : $CHAR14.
        WholeBunchOfVars_201606 : $CHAR24. ;
DATALINES4;
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
42001223829005420012238221001913672811803603960
42005537849003420055378411001728228741505881940
42000000420200 149282958163677760150000
42000001251100 137224761123272858132500
42000002571100 115006036153103523280000
42000002992100 176705643184314704200000
42000003721100 134284160134284151225000
;;;;

DATA Data201607;
    LENGTH
        Var1             $ 14
        Var2             $ 14
        WholeBunchOfVars_201607 $ 24 ;
    FORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201607 $CHAR24. ;
    INFORMAT
        Var1             $CHAR14.
        Var2             $CHAR14.
        WholeBunchOfVars_201607 $CHAR24. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        Var1             : $CHAR14.
        Var2             : $CHAR14.
        WholeBunchOfVars_201607 : $CHAR24. ;
DATALINES4;
42001223829005420012238221001913672811803603960
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
48073274008100...50000
48077274928100...150000
48120011318100...200000
48097297288100...200000
48001314298100...250000
48035295388100...575000
48089429268100...600000
48001012248100...600000
48075642348100...700000
42026909980100.000000000.25000
42001115759100.000000000145868575112500
42014081290100.00000000015346509325000
48000000048500.001241385.50000
48000000048400.001241385.150000
;;;;


data want;
 if _n_=1 then do;
  if 0 then set Data201607;
  declare hash h1(dataset:'Data201607',multidata:'y');
  h1.definekey('Var1','Var2');
  h1.definedata('WholeBunchOfVars_201607 ');
  h1.definedone();
  
  declare hash h2(dataset:'Data201607',multidata:'y');
  h2.definekey('Var2');
  h2.definedata('WholeBunchOfVars_201607 ');
  h2.definedone();
 end;
call missing(of _all_);
set Data201606;
 rc=h1.find();
 if rc=0 then do;
  do while(rc=0);
   output;
   rc=h1.find_next();
  end;
 end;
  else do;
         rx=h2.find(key:var1);
         if rx=0 then do;
           do while(rx=0);
            output;
            rx=h1.find_next();
           end;
         end;
         else output;
        end;
drop rc rx;
run;



View solution in original post


All Replies
Super User
Posts: 9,687

Re: More efficient way of proc sql joining ON a disjunction


Hash Table.
Post your data and the output you want to see .

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

Hi, Xia.
I'm not sure I understand what you mean with the output I want to see.
The output is huge, the inputs are equally large. I can't possibly do it by hand and SAS takes too long to do it.
Super User
Posts: 9,687

Re: More efficient way of proc sql joining ON a disjunction

I mean make a little example to show what you are looking for . Like :

data a;
input v1 v2 x;
cards;
1 1 1
1 2 2
2 1 2
;
run;
data b;
input v1 v2 x;
cards;
1 1 1
1 2 2
2 1 2
;
run;

output:
........
Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

I posted an example. Thank you for your time.

Super User
Super User
Posts: 7,413

Re: More efficient way of proc sql joining ON a disjunction

As @Ksharp has said, show an example of your data (as a datastep).  If the data is the same, why do you have it in two datasets?  Identify what needs to be separate, and then put the data into one dataset with a parameter/result setup in a normalised dataset;
PARAM                     RES

PERIOD 1 RESULT  123

PERIOD 2 RESULT   345

 

You can then do many things with that one dataset, by grouping on param for instance, or transposing using param.  Its rarely of any benefit to split data up.

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

It's not actually the same data set. I thought what I said, even though technically false, would be understandable.
What I meant is that the data sets have the same variables, but the number of rows and cell values are different.
Super User
Super User
Posts: 7,413

Re: More efficient way of proc sql joining ON a disjunction

Yes, but if the datasets are structurally the same, with the only difference being a value Period, why not put the two datasets together, and have a variable called period.  That way you can transpose by period, sort and do datastep processing etc.

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

[ Edited ]

I don't think I've made my self clear. The data sets are different on possibly a LOT of values. Furthermore, some variables have the period in their name, which makes them different variables though I was referring to them as being the same variable. Anyway, I posted an example.

Super User
Super User
Posts: 7,413

Re: More efficient way of proc sql joining ON a disjunction

Well, I don't know your data so I am only going on what I see posted.  It is something that comes up a lot, and it is due to the desire to put "data" in datasetnames (and variables).  If you have ever done databasing you will know about normalised fixed data structures which don't change.  These are far simpler to work with in SAS also, and SAS is built to do this kind of functioning in built.  Lets take your example test data, I create a single table with a fixed structure, putting the "data" from the dataset name into a proper variable in the data so that I can work with it like data.  I then present a simple example of transposing the data based on this data using the "data" items created - i.e. no merge, just a transpose with by group.

data data201606;
  length var1 var2 $14 wholebunchofvars_201606 $24;
  format var1 var2 $char14. wholebunchofvars_201606 $char24.;
  informat var1 var2 $char14. wholebunchofvars_201606 $char24.;
  infile datalines4 dlm='7f'x missover dsd;
  input var1 : $char14. var2 : $char14. wholebunchofvars_201606 : $char24.;
datalines4;
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
42001223829005420012238221001913672811803603960
42005537849003420055378411001728228741505881940
42000000420200 149282958163677760150000
42000001251100 137224761123272858132500
42000002571100 115006036153103523280000
42000002992100 176705643184314704200000
42000003721100 134284160134284151225000
;;;;
run;

data data201607;
  length var1 var2 $14 wholebunchofvars_201607 $24;
  format var1 var2 $char14. wholebunchofvars_201607 $char24.;
  informat var1 var2 $char14. wholebunchofvars_201607 $char24.;
  infile datalines4 dlm='7f'x missover dsd;
  input var1 : $char14. var2 : $char14. wholebunchofvars_201607 : $char24.;
datalines4;
42001223829005420012238221001913672811803603960
42001223829005420012238211001913672811803603960
42001223829005420012238291001913672811803603960
48073274008100...50000
48077274928100...150000
48120011318100...200000
48097297288100...200000
48001314298100...250000
48035295388100...575000
48089429268100...600000
48001012248100...600000
48075642348100...700000
42026909980100.000000000.25000
42001115759100.000000000145868575112500
42014081290100.00000000015346509325000
48000000048500.001241385.50000
48000000048400.001241385.150000
;;;;
run;

data working_data;
  set data201606 (in=a rename=(wholebunchofvars_201606=wholebunchofvars)) 
      data201607 (in=b rename=(wholebunchofvars_201607=wholebunchofvars));
  period=ifn(a,1,2);
  date=ifn(a,"01JUN2016"d,"01JUL2016"d);
  format date date9.;
run;

proc sort data=working_data;
  by var1 var2 period;
run;
proc transpose data=working_data out=t_work_data;
  by var1 var2;
  var wholebunchofvars;
  id period;
  idlabel period;
run;

 

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

For some reason I'm unable to run your code. I attached the error to this message. Do you have any idea what character is causing this?


Prog.jpg
Super User
Super User
Posts: 7,413

Re: More efficient way of proc sql joining ON a disjunction

Yes, there was something odd when I copied out from the test data you gave to my environment, it shows a square between some of the data.  Must be the different system encodings.  Anyways, if you take your original test data, and copy from below the datalines to the ;;;; and replace that in the code you run, it should be fine.  

 

Could be something wierd with the website too.  This is what I see when I copy your test data to my system:

Capture.PNG

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

No luck. The problem is when it rans the working_data data step.
Super User
Super User
Posts: 7,413

Re: More efficient way of proc sql joining ON a disjunction

[ Edited ]

Not sure what to suggest, I just copied it out of the website and ran it, worked fine.  Have you tried a new session?  Does the code look the same as on here, it just looks like some special character is messing it up.

Try copying and pasting the whole text into Notepad first.

Contributor
Posts: 46

Re: More efficient way of proc sql joining ON a disjunction

I did try a new session and also tried to copy to Notepad first, still didn't work. I too am at a loss.

Thanks for the tips.
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 16 replies
  • 508 views
  • 3 likes
  • 3 in conversation