Hi, I have a very simple SQL procedure (Cartesian) from two tables:
proc sql;
create table Final as
select a.*, b.* from Base a
join Dict b
on trim(a.Name_2) contains trim(b.Name_1)
;
quit;
it works, but very slowly. My server needs one month to process this. Do you have an idea how to speed it?
LinusH,
In my opinion, HASH TABLES will not work with contains. Columns name_1 and name_2 are string and it searches the contents.
ChrisNZ,
I checked FIND and INDEX and they were slower. What is magic=103 and option buffersize=? Have you got sample query?
I think that this is a text mining problem and I thought about using SAS Sentiment analytyst, what do you think?
As defined currently there's not much of any type of analysis.
Your doing a lookup, but it depends on the output you want.
ASSUMING your dictionary is single words and your other dataset are phrases, maybe a custom lookup and splitting the data into words rather than leaving it as phrases would be faster. You do have to be careful though, is Apples the same as Apple, and what about Applebee, the restaurant. It's not an exact science.
Alright I ran a few tests on my PC.
Firstly, the process is CPU-bound for me, so no point trying to optimise I/Os through better use of RAM or buffers.
Secondly, the contains operator is as fast as function index(trim()), while funtion find() is slower. So SAS added a new function that is slower than the old one? Why?
Thirdly I tried a few data step matching logics to see how they compared to the SQL join: Scanning an array, scanning the table directly and scanning a hash table. The array scan is 20% faster than SQL, the point= direct access 20% slower (but on par with SQL when data is loaded in memory with sasfile), and the hash match is 95% faster than SQL.
We have a clear winner here.
More benchmarks and performance tips in
https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490
I may reuse this example in a future edition!
Log below.
1051 data PHRASES;
1052 length PHRASE $2000;
1053 PHRASE=repeat('x',1000);
1054 do I=1 to 1e5; output; end ;
1055 run;
NOTE: The data set WORK.PHRASES has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.30 seconds
user cpu time 0.06 seconds
system cpu time 0.26 seconds
memory 186.01k
OS Memory 11888.00k
Timestamp 31/07/2016 05:34:07 PM
1056
1057 data BRANDS; length BRAND $40;
1058 BRAND=repeat('y',25);
1059 do J=1 to 199; output; end ;
1060 BRAND=repeat('x',25); output;
1061 run;
NOTE: The data set WORK.BRANDS has 200 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 181.54k
OS Memory 11888.00k
Timestamp 31/07/2016 05:34:07 PM
1062
1063 proc sql;
1064 create table OUT_SQL_CONTAINS as
1065 select * from PHRASES, BRANDS
1066 where trim(PHRASES.PHRASE) contains trim(BRANDS.BRAND);
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.OUT_SQL_CONTAINS created, with 100000 rows and 4 columns.
1067 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 28.02 seconds
user cpu time 27.44 seconds
system cpu time 0.48 seconds
memory 539.80k
OS Memory 11888.00k
Timestamp 31/07/2016 05:34:35 PM
1068
1069 proc sql;
1070 create table OUT_SQL_INDEX as
1071 select * from PHRASES, BRANDS
1072 where index(trim(PHRASES.PHRASE),trim(BRANDS.BRAND));
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.OUT_SQL_INDEX created, with 100000 rows and 4 columns.
1073 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 27.99 seconds
user cpu time 27.33 seconds
system cpu time 0.53 seconds
memory 552.67k
OS Memory 11888.00k
Timestamp 31/07/2016 05:35:03 PM
1074
1075 proc sql;
1076 create table OUT_SQL_FIND1 as
1077 select * from PHRASES, BRANDS
1078 where find(trim(PHRASES.PHRASE),trim(BRANDS.BRAND));
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.OUT_SQL_FIND1 created, with 100000 rows and 4 columns.
1079 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 37.12 seconds
user cpu time 36.59 seconds
system cpu time 0.42 seconds
memory 552.67k
OS Memory 11888.00k
Timestamp 31/07/2016 05:35:40 PM
1080
1081 proc sql;
1082 create table OUT_SQL_FIND2 as
1083 select * from PHRASES, BRANDS
1084 where find(PHRASES.PHRASE,BRANDS.BRAND,'t');
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.OUT_SQL_FIND2 created, with 100000 rows and 4 columns.
1085 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 34.66 seconds
user cpu time 34.14 seconds
system cpu time 0.60 seconds
memory 528.54k
OS Memory 11888.00k
Timestamp 31/07/2016 05:36:15 PM
1086
1087 data OUT_DS_ARRAY;
1088 array BRANDS [200] $40 _temporary_;
1089 if _N_=1 then do K=1 to 200;
1090 set BRANDS;
1091 BRANDS[K]=BRAND;
1092 end;
1093 set PHRASES;
1094 do K=1 to 200;
1095 if index(trim(PHRASE), trim(BRANDS[K])) then do;
1096 BRAND=BRANDS[K];
1097 output;
1098 end;
1099 end;
1100 drop K;
1101 run;
NOTE: There were 200 observations read from the data set WORK.BRANDS.
NOTE: There were 100000 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_ARRAY has 100000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 24.82 seconds
user cpu time 24.11 seconds
system cpu time 0.70 seconds
memory 369.45k
OS Memory 11888.00k
Timestamp 31/07/2016 05:36:40 PM
1102
1103 data OUT_DS_POINT;
1104 set PHRASES;
1105 do K=1 to 200;
1106 set BRANDS point=K;
1107 if index(trim(PHRASE), trim(BRAND)) then output;
1108 end;
1109 drop K;
1110 run;
NOTE: There were 100000 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_POINT has 100000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 31.10 seconds
user cpu time 28.56 seconds
system cpu time 2.46 seconds
memory 326.67k
OS Memory 11888.00k
Timestamp 31/07/2016 05:37:11 PM
1111
1112 data OUT_DS_HASH;
1113 set PHRASES;
1114 if _N_=1 then do;
1115 dcl hash BRANDS(dataset:'BRANDS');
1116 BRANDS.definekey('BRAND');
1117 BRANDS.definedata('BRAND','J');
1118 BRANDS.definedone();
1119 declare hiter IT('BRANDS');
1120 if 0 then set BRANDS;
1121 end;
1122 RC=IT.first();
1123 do while (RC = 0);
1124 if index(trim(PHRASE), trim(BRAND)) then output;
1125 RC=IT.next();
1126 end;
1127 drop RC;
1128 run;
NOTE: There were 200 observations read from the data set WORK.BRANDS.
NOTE: There were 100000 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_HASH has 100000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.98 seconds
user cpu time 0.56 seconds
system cpu time 0.42 seconds
memory 701.21k
OS Memory 11888.00k
Timestamp 31/07/2016 05:37:12 PM
Hi,
Chris, I appreciate your help. You are amazing!
I tested your query. In my test Contain is still faster than INDEX & HASH TABLES. Maybe I’m doing something wrong?
On the another hand, I had problem, because I had to divide table into 2 smaller.
Contain: 14 min
OUT_DS_HASH: 21min + 22 min = 43 min
Log below:
499 /****************************************/
500 /* Standard Query */
501 /****************************************/
502
503 proc sql;
504 create table F_Standard as
505 select a.*, b.* from Phrases a
506 join Brands b
507 on trim(a.PHRASE) contains trim(b.BRAND)
508 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Compressing data set WORK.F_STANDARD decreased size by 62.50 percent.
Compressed is 6 pages; un-compressed would require 16 pages.
NOTE: Table WORK.F_STANDARD created, with 775 rows and 3 columns.
509 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 13:58.42
cpu time 13:58.16
510
511 /****************************************/
512 /* New Query */
513 /****************************************/
514
593 data OUT_DS_HASH;
594 set PHRASES;
595 if _N_=1 then do;
596 dcl hash BRANDS(dataset:'BRANDS');
597 BRANDS.definekey('BRAND');
598 BRANDS.definedata('BRAND','J');
599 BRANDS.definedone();
600 declare hiter IT('BRANDS');
601 if 0 then set BRANDS;
602 end;
603 RC=IT.first();
604 do while (RC = 0);
605 if index(trim(PHRASE), trim(BRAND)) then output;
606 RC=IT.next();
607 end;
608 drop RC;
609 run;
ERROR: Hash object added 17301488 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 1 observations read from the data set WORK.PHRASES.
WARNING: The data set WORK.OUT_DS_HASH may be incomplete. When this step was stopped there
were 0 observations and 3 variables.
WARNING: Data set WORK.OUT_DS_HASH was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 16.72 seconds
cpu time 16.41 seconds
532
533 /****************************************/
534 /* New Query part 2 */
535 /****************************************/
536
537 data Brands_00;
538 set Brands;
539 RANDOM_INT = INT(RANUNI(0)*1000000);
540 run;
NOTE: There were 21597444 observations read from the data set WORK.BRANDS.
NOTE: The data set WORK.BRANDS_00 has 21597444 observations and 3 variables.
NOTE: Compressing data set WORK.BRANDS_00 decreased size by 31.68 percent.
Compressed is 175669 pages; un-compressed would require 257113 pages.
NOTE: DATA statement used (Total process time):
real time 16.05 seconds
cpu time 16.02 seconds
541
542 data Brands_01;
543 set Brands_00;
544 where RANDOM_INT > 500000;
545 drop RANDOM_INT;
546 run;
NOTE: There were 10800201 observations read from the data set WORK.BRANDS_00.
WHERE RANDOM_INT>500000;
NOTE: The data set WORK.BRANDS_01 has 10800201 observations and 2 variables.
NOTE: Compressing data set WORK.BRANDS_01 decreased size by 30.13 percent.
Compressed is 81138 pages; un-compressed would require 116132 pages.
NOTE: DATA statement used (Total process time):
real time 8.95 seconds
cpu time 8.79 seconds
547
548 data Brands_02;
549 set Brands_00;
550 where RANDOM_INT <= 500000;
551 drop RANDOM_INT;
552 run;
NOTE: There were 10797243 observations read from the data set WORK.BRANDS_00.
WHERE RANDOM_INT<=500000;
NOTE: The data set WORK.BRANDS_02 has 10797243 observations and 2 variables.
NOTE: Compressing data set WORK.BRANDS_02 decreased size by 30.13 percent.
Compressed is 81117 pages; un-compressed would require 116100 pages.
NOTE: DATA statement used (Total process time):
real time 8.57 seconds
cpu time 8.54 seconds
553
554 data OUT_DS_HASH_01;
555 set PHRASES;
556 if _N_=1 then do;
557 dcl hash BRANDS_01(dataset:'BRANDS_01');
558 BRANDS_01.definekey('BRAND');
559 BRANDS_01.definedata('BRAND','J');
560 BRANDS_01.definedone();
561 declare hiter IT('BRANDS_01');
562 if 0 then set BRANDS_01;
563 end;
564 RC=IT.first();
565 do while (RC = 0);
566 if index(trim(PHRASE), trim(BRAND)) then output;
567 RC=IT.next();
568 end;
569 drop RC;
570 run;
NOTE: There were 10800201 observations read from the data set WORK.BRANDS_01.
NOTE: There were 330 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_HASH_01 has 399 observations and 3 variables.
NOTE: Compressing data set WORK.OUT_DS_HASH_01 decreased size by 50.00 percent.
Compressed is 4 pages; un-compressed would require 8 pages.
NOTE: DATA statement used (Total process time):
real time 21:21.56
cpu time 21:19.62
571
572 data OUT_DS_HASH_02;
573 set PHRASES;
574 if _N_=1 then do;
575 dcl hash BRANDS_02(dataset:'BRANDS_02');
576 BRANDS_02.definekey('BRAND');
577 BRANDS_02.definedata('BRAND','J');
578 BRANDS_02.definedone();
579 declare hiter IT('BRANDS_02');
580 if 0 then set BRANDS_02;
581 end;
582 RC=IT.first();
583 do while (RC = 0);
584 if index(trim(PHRASE), trim(BRAND)) then output;
585 RC=IT.next();
586 end;
587 drop RC;
588 run;
NOTE: There were 10797243 observations read from the data set WORK.BRANDS_02.
NOTE: There were 330 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_HASH_02 has 376 observations and 3 variables.
NOTE: Compressing data set WORK.OUT_DS_HASH_02 decreased size by 50.00 percent.
Compressed is 4 pages; un-compressed would require 8 pages.
NOTE: DATA statement used (Total process time):
real time 22:08.78
cpu time 22:02.82
589
590 Data OUT_DS_HASH;
591 set OUT_DS_HASH_02 OUT_DS_HASH_01;
592 run;
NOTE: There were 376 observations read from the data set WORK.OUT_DS_HASH_02.
NOTE: There were 399 observations read from the data set WORK.OUT_DS_HASH_01.
NOTE: The data set WORK.OUT_DS_HASH has 775 observations and 3 variables.
NOTE: Compressing data set WORK.OUT_DS_HASH decreased size by 62.50 percent.
Compressed is 6 pages; un-compressed would require 16 pages.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Ok, but why? Sentiment analysis looks for positive/negative connotations to text AFAIK. Is that related to your problem?
SAS Text Analytics tool, an add on for SAS EM may be beneficial but I doubt it.
Can each statement meet multiple company names or do you expect a single one per observation?
You need to provide more information about your data, or a good replication of similar fake data and cardinality if you want us to replicate your problem. Also please use option fullstimer.
.
Your new number of observations is radically different from what you first mentioned, but I get even better results for the hash table, which is now 10,000 times faster than SQL.
36 options fullstimer;
37
38 data PHRASES;
39 length PHRASE $2000;
40 PHRASE=repeat('x',1000);
41 do I=1 to 300; output; end ;
42 run;
NOTE: The data set WORK.PHRASES has 300 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
memory 186.10k
OS Memory 6768.00k
Timestamp 1/08/2016 10:10:44 PM
43
44 data BRANDS;
length BRAND $40;
45 BRAND=repeat('y',25);
46 do J=1 to 1e6; output; end ;
47 BRAND=repeat('x',25); output;
48 run;
NOTE: The data set WORK.BRANDS has 1000001 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
user cpu time 0.09 seconds
system cpu time 0.07 seconds
memory 181.64k
OS Memory 6768.00k
Timestamp 1/08/2016 10:10:44 PM
49
50 proc sql;
51 create table OUT_SQL_CONTAINS as
52 select * from PHRASES, BRANDS
53 where trim(PHRASES.PHRASE) contains trim(BRANDS.BRAND);
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.OUT_SQL_CONTAINS created, with 300 rows and 4 columns.
54 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 6:49.74
user cpu time 6:48.03
system cpu time 0.21 seconds
memory 1041.61k
OS Memory 7280.00k
Timestamp 1/08/2016 10:17:34 PM
55
56 data OUT_DS_HASH;
57 set PHRASES;
58 if _N_=1 then do;
59 dcl hash BRANDS(dataset:'BRANDS');
60 BRANDS.definekey('BRAND');
61 BRANDS.definedata('BRAND','J');
62 BRANDS.definedone();
63 declare hiter IT('BRANDS');
64 if 0 then set BRANDS;
65 end;
66 RC=IT.first();
67 do while (RC = 0);
68 if index(trim(PHRASE), trim(BRAND)) then output;
69 RC=IT.next();
70 end;
71 drop RC;
72 run;
NOTE: There were 1000001 observations read from the data set WORK.BRANDS.
NOTE: There were 300 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_HASH has 300 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.32 seconds
user cpu time 0.25 seconds
system cpu time 0.06 seconds
memory 741.91k
OS Memory 6768.00k
Timestamp 1/08/2016 10:17:34 PM
How SAS can iterate through a hash table 300 million times in 0.3s is beyond me. Is this right?
That's one iteration every other CPU clock cycle! This sounds implausible.
Also very odd is that the data step only used 7 MB of RAM. The hash table alone is 40+ MB.
Is this right? What is going on?
You defined brands as 1 million repeats of the same value and one repeat of another value. Since you also defined BRAND as key in the hash, you end up with only 2 items in there.
So you have only 600 iterations in the loop.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.