BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mrzlatan91
Obsidian | Level 7

Hey guys, 

Im a student from Germany (so sorry for my bad english) and Im having a Code with, containing a SQL step with Inner Join, but the program runs tooooo slowly. I guess there could be a faster version (e.g. with a data step) for the following code:

 

proc sql;

create table work.Result as select distinct a.*, b.Value from 

work.Table1 As a left join work.Table2 As B on (a.id = b.id and a.date = b.date);

quit;

 

The problem is that my program contains a macro and the code runs very often, so every second counts 🙂

 

I´d appreciate it so much if you could help me with that.

 

Thanks in advance,

Alex

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

@mrzlatan91, I'm quoting few of your statemwnts:

Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% 
 Every Date/Value Combination of Table 2 is Unique.

Try next solution:

proc sort data=table1 out=temp1 nodupkey;
    by date id;
run;
/* if table1 is already sorted then replace above sort with:
       data temp1;
         set table1;
              by date value;
                   if last.value;  /* output distinct values */
      run;
******************************/
proc sort data=table2; by date id; run; /* verify that table is sorted */

data want;
merge table1(in=in1)
           table2(keep=value rename=(value=newvalue) in=in2);
  by date id;
       if in1;
run;

View solution in original post

15 REPLIES 15
Shmuel
Garnet | Level 18

1) The macro has no effect on run time.

 

2) If the relation between the 2 tables is many to many,

    then SQL is almost the only relevant to be used.

    Alternative way is using hash method - keeping one of the tables in memeory.

 

   if the relation is 1:N (or N:1)  it is better to sort both tables and MERGE them in a sas datastep.

 

3) Long time may be the result of big tables with a lot of variables.

    If you don't need all variables keep the relevant ones only.
    

Astounding
PROC Star

Many ways ... a hash table would be faster, a format would be faster, it's even possible that sort+merge would be faster.  Here's a key related question that influences whether these other approaches would be suitable:

 

Does Table2 contain just a single observation for each combination of ID and DATE?

mrzlatan91
Obsidian | Level 7

Wow, that sounds good. 

 

Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% (it depends on the iteration)

Unfortunately, I don´t know how to use the Hash Function to solve this problem.

Could anyone provide me a solution for my posted problem?

 

 

ballardw
Super User

@mrzlatan91 wrote:

Wow, that sounds good. 

 

Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% (it depends on the iteration)

Unfortunately, I don´t know how to use the Hash Function to solve this problem.

Could anyone provide me a solution for my posted problem?

 

 


Provide some example data in the form of data steps if you want tested code.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

I realize that small data sets won't have the same time of execution issue but code suggestions have a much better chance of actually working if provided. Any sensitive data values should be masked to protect the guilty.

mrzlatan91
Obsidian | Level 7

Sorry, I should have done this before. I´ve made a simplified version, because the size of my dataset is around 40GB.

Both Tables have the same structure, but Table 1 is much bigger than Table 2.

 

data table1;
input date $ value $ id $ fundid $;
datalines;
01OCT18 50 1234 10000
01OCT18 52 1234 10001
01OCT19 50 1234 10000
01OCT19 53 1234 10001
01OCT19 54 1234 10002
01NOV19 60 1236 10000
01NOV19 61 1236 10001
01NOV19 62 1236 10002
01OCT19 54 1234 10007
01NOV19 55 1236 10008
01NOV19 35 1236 10009
01NOV19 66 1236 10010
01DEC19 40 1236 10008
01DEC19 42 1236 10009
01DEC19 53 1236 10010
;
run;

data table2;
input date $ value $ id $ fundid $;
datalines;
01OCT18 50 1234 10000
01OCT18 52 1234 10000
01OCT19 50 1234 10000
;
run;

proc sql;
create table work.result as select distinct a.*, b.value as newValue from
work.table1 as a left join work.table2 as b on (a.id=b.id and a.date = b.date);
quit;

Kurt_Bremser
Super User

From your example data, I can see no double entries for combinations of id and date. If this is true for the whole dataset, you can use a sort + data step merge. And the distinct in the SQL would not be needed.

mrzlatan91
Obsidian | Level 7

Oh **bleep**, I think this was the reason why I´ve made the distinct clause.

 

What would be the solution if the distinct part is needed? Hash?

Kurt_Bremser
Super User

The next question would be the size of table2. If sizeof(id + date + value) * nobs can fit into memory, creating a format or loading into a hash table would make sorting unnecessary.

mrzlatan91
Obsidian | Level 7

It depends on the Iteration, because I analyze every fund in my dataset.

Some funds have 100,000 Observations, some funds only 100.

Could you provide me a Hash-Solution?

Astounding
PROC Star

Take a look at the data you posted.  Both data sets contain 2 observations for this combination:

 

ID=1234 

date=01OCT2018

 

Looking at the data, how do you know which observations to match?  How do you know which value to use from Table 2?  Do you need to match on not just ID and DATE, but also on FUNDID?

mrzlatan91
Obsidian | Level 7

Oh sorry,

this was a mistake. Every Date/Value Combination of Table 2 is Unique.

 

I think I should explain it a little bit. Table 1 contains Valuations (Value) for Stocks/Bonds (IDs), that are made by some Funds (Fundid) in a specific month (date).

Table 2 contains Valuations for Stocks of ONE specific Fund, so every Date/Value Combination is unique.

 

 

Shmuel
Garnet | Level 18

@mrzlatan91, I'm quoting few of your statemwnts:

Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% 
 Every Date/Value Combination of Table 2 is Unique.

Try next solution:

proc sort data=table1 out=temp1 nodupkey;
    by date id;
run;
/* if table1 is already sorted then replace above sort with:
       data temp1;
         set table1;
              by date value;
                   if last.value;  /* output distinct values */
      run;
******************************/
proc sort data=table2; by date id; run; /* verify that table is sorted */

data want;
merge table1(in=in1)
           table2(keep=value rename=(value=newvalue) in=in2);
  by date id;
       if in1;
run;
mrzlatan91
Obsidian | Level 7

Thank you very much!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 18802 views
  • 0 likes
  • 5 in conversation