Hey guys,
Im a student from Germany (so sorry for my bad english) and Im having a Code with, containing a SQL step with Inner Join, but the program runs tooooo slowly. I guess there could be a faster version (e.g. with a data step) for the following code:
proc sql;
create table work.Result as select distinct a.*, b.Value from
work.Table1 As a left join work.Table2 As B on (a.id = b.id and a.date = b.date);
quit;
The problem is that my program contains a macro and the code runs very often, so every second counts 🙂
I´d appreciate it so much if you could help me with that.
Thanks in advance,
Alex
@mrzlatan91, I'm quoting few of your statemwnts:
Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1%
Every Date/Value Combination of Table 2 is Unique.
Try next solution:
proc sort data=table1 out=temp1 nodupkey;
by date id;
run;
/* if table1 is already sorted then replace above sort with:
data temp1;
set table1;
by date value;
if last.value; /* output distinct values */
run;
******************************/
proc sort data=table2; by date id; run; /* verify that table is sorted */
data want;
merge table1(in=in1)
table2(keep=value rename=(value=newvalue) in=in2);
by date id;
if in1;
run;
1) The macro has no effect on run time.
2) If the relation between the 2 tables is many to many,
then SQL is almost the only relevant to be used.
Alternative way is using hash method - keeping one of the tables in memeory.
if the relation is 1:N (or N:1) it is better to sort both tables and MERGE them in a sas datastep.
3) Long time may be the result of big tables with a lot of variables.
If you don't need all variables keep the relevant ones only.
Many ways ... a hash table would be faster, a format would be faster, it's even possible that sort+merge would be faster. Here's a key related question that influences whether these other approaches would be suitable:
Does Table2 contain just a single observation for each combination of ID and DATE?
Wow, that sounds good.
Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% (it depends on the iteration)
Unfortunately, I don´t know how to use the Hash Function to solve this problem.
Could anyone provide me a solution for my posted problem?
@mrzlatan91 wrote:
Wow, that sounds good.
Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1% (it depends on the iteration)
Unfortunately, I don´t know how to use the Hash Function to solve this problem.
Could anyone provide me a solution for my posted problem?
Provide some example data in the form of data steps if you want tested code.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
I realize that small data sets won't have the same time of execution issue but code suggestions have a much better chance of actually working if provided. Any sensitive data values should be masked to protect the guilty.
Sorry, I should have done this before. I´ve made a simplified version, because the size of my dataset is around 40GB.
Both Tables have the same structure, but Table 1 is much bigger than Table 2.
data table1;
input date $ value $ id $ fundid $;
datalines;
01OCT18 50 1234 10000
01OCT18 52 1234 10001
01OCT19 50 1234 10000
01OCT19 53 1234 10001
01OCT19 54 1234 10002
01NOV19 60 1236 10000
01NOV19 61 1236 10001
01NOV19 62 1236 10002
01OCT19 54 1234 10007
01NOV19 55 1236 10008
01NOV19 35 1236 10009
01NOV19 66 1236 10010
01DEC19 40 1236 10008
01DEC19 42 1236 10009
01DEC19 53 1236 10010
;
run;
data table2;
input date $ value $ id $ fundid $;
datalines;
01OCT18 50 1234 10000
01OCT18 52 1234 10000
01OCT19 50 1234 10000
;
run;
proc sql;
create table work.result as select distinct a.*, b.value as newValue from
work.table1 as a left join work.table2 as b on (a.id=b.id and a.date = b.date);
quit;
From your example data, I can see no double entries for combinations of id and date. If this is true for the whole dataset, you can use a sort + data step merge. And the distinct in the SQL would not be needed.
Oh **bleep**, I think this was the reason why I´ve made the distinct clause.
What would be the solution if the distinct part is needed? Hash?
The next question would be the size of table2. If sizeof(id + date + value) * nobs can fit into memory, creating a format or loading into a hash table would make sorting unnecessary.
It depends on the Iteration, because I analyze every fund in my dataset.
Some funds have 100,000 Observations, some funds only 100.
Could you provide me a Hash-Solution?
Take a look at the data you posted. Both data sets contain 2 observations for this combination:
ID=1234
date=01OCT2018
Looking at the data, how do you know which observations to match? How do you know which value to use from Table 2? Do you need to match on not just ID and DATE, but also on FUNDID?
Oh sorry,
this was a mistake. Every Date/Value Combination of Table 2 is Unique.
I think I should explain it a little bit. Table 1 contains Valuations (Value) for Stocks/Bonds (IDs), that are made by some Funds (Fundid) in a specific month (date).
Table 2 contains Valuations for Stocks of ONE specific Fund, so every Date/Value Combination is unique.
@mrzlatan91, I'm quoting few of your statemwnts:
Table 1 contains a lot of observations, table 2 has always a part of it, lets say less than 1%
Every Date/Value Combination of Table 2 is Unique.
Try next solution:
proc sort data=table1 out=temp1 nodupkey;
by date id;
run;
/* if table1 is already sorted then replace above sort with:
data temp1;
set table1;
by date value;
if last.value; /* output distinct values */
run;
******************************/
proc sort data=table2; by date id; run; /* verify that table is sorted */
data want;
merge table1(in=in1)
table2(keep=value rename=(value=newvalue) in=in2);
by date id;
if in1;
run;
Thank you very much!
Also decide if the distinct is really needed. It necessitates a sort by all variables contained in the output.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.