SAS Support Communities

pietro342 · ‎08-09-2016

Hi JohnHoughton, I'm sorry I did not reply to messages. I tested your solution 15 minutes after your post and I could not believe. It works excellent! Unfortunately, I had to go back to my dictionary. I'm just finishing it. I will tested your solution today. I'll put log 🙂 ChrisNZ, JohnHoughton, Thank you very much. You are brilliant!

pietro342 · ‎08-01-2016

Hi, Chris, I appreciate your help. You are amazing! I tested your query. In my test Contain is still faster than INDEX & HASH TABLES. Maybe I’m doing something wrong? On the another hand, I had problem, because I had to divide table into 2 smaller. Contain: 14 min OUT_DS_HASH: 21min + 22 min = 43 min Log below: 499 /****************************************/ 500 /* Standard Query */ 501 /****************************************/ 502 503 proc sql; 504 create table F_Standard as 505 select a.*, b.* from Phrases a 506 join Brands b 507 on trim(a.PHRASE) contains trim(b.BRAND) 508 ; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Compressing data set WORK.F_STANDARD decreased size by 62.50 percent. Compressed is 6 pages; un-compressed would require 16 pages. NOTE: Table WORK.F_STANDARD created, with 775 rows and 3 columns. 509 quit; NOTE: PROCEDURE SQL used (Total process time): real time 13:58.42 cpu time 13:58.16 510 511 /****************************************/ 512 /* New Query */ 513 /****************************************/ 514 593 data OUT_DS_HASH; 594 set PHRASES; 595 if _N_=1 then do; 596 dcl hash BRANDS(dataset:'BRANDS'); 597 BRANDS.definekey('BRAND'); 598 BRANDS.definedata('BRAND','J'); 599 BRANDS.definedone(); 600 declare hiter IT('BRANDS'); 601 if 0 then set BRANDS; 602 end; 603 RC=IT.first(); 604 do while (RC = 0); 605 if index(trim(PHRASE), trim(BRAND)) then output; 606 RC=IT.next(); 607 end; 608 drop RC; 609 run; ERROR: Hash object added 17301488 items when memory failure occurred. FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase. ERROR: The SAS System stopped processing this step because of insufficient memory. NOTE: There were 1 observations read from the data set WORK.PHRASES. WARNING: The data set WORK.OUT_DS_HASH may be incomplete. When this step was stopped there were 0 observations and 3 variables. WARNING: Data set WORK.OUT_DS_HASH was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 16.72 seconds cpu time 16.41 seconds 532 533 /****************************************/ 534 /* New Query part 2 */ 535 /****************************************/ 536 537 data Brands_00; 538 set Brands; 539 RANDOM_INT = INT(RANUNI(0)*1000000); 540 run; NOTE: There were 21597444 observations read from the data set WORK.BRANDS. NOTE: The data set WORK.BRANDS_00 has 21597444 observations and 3 variables. NOTE: Compressing data set WORK.BRANDS_00 decreased size by 31.68 percent. Compressed is 175669 pages; un-compressed would require 257113 pages. NOTE: DATA statement used (Total process time): real time 16.05 seconds cpu time 16.02 seconds 541 542 data Brands_01; 543 set Brands_00; 544 where RANDOM_INT > 500000; 545 drop RANDOM_INT; 546 run; NOTE: There were 10800201 observations read from the data set WORK.BRANDS_00. WHERE RANDOM_INT>500000; NOTE: The data set WORK.BRANDS_01 has 10800201 observations and 2 variables. NOTE: Compressing data set WORK.BRANDS_01 decreased size by 30.13 percent. Compressed is 81138 pages; un-compressed would require 116132 pages. NOTE: DATA statement used (Total process time): real time 8.95 seconds cpu time 8.79 seconds 547 548 data Brands_02; 549 set Brands_00; 550 where RANDOM_INT <= 500000; 551 drop RANDOM_INT; 552 run; NOTE: There were 10797243 observations read from the data set WORK.BRANDS_00. WHERE RANDOM_INT<=500000; NOTE: The data set WORK.BRANDS_02 has 10797243 observations and 2 variables. NOTE: Compressing data set WORK.BRANDS_02 decreased size by 30.13 percent. Compressed is 81117 pages; un-compressed would require 116100 pages. NOTE: DATA statement used (Total process time): real time 8.57 seconds cpu time 8.54 seconds 553 554 data OUT_DS_HASH_01; 555 set PHRASES; 556 if _N_=1 then do; 557 dcl hash BRANDS_01(dataset:'BRANDS_01'); 558 BRANDS_01.definekey('BRAND'); 559 BRANDS_01.definedata('BRAND','J'); 560 BRANDS_01.definedone(); 561 declare hiter IT('BRANDS_01'); 562 if 0 then set BRANDS_01; 563 end; 564 RC=IT.first(); 565 do while (RC = 0); 566 if index(trim(PHRASE), trim(BRAND)) then output; 567 RC=IT.next(); 568 end; 569 drop RC; 570 run; NOTE: There were 10800201 observations read from the data set WORK.BRANDS_01. NOTE: There were 330 observations read from the data set WORK.PHRASES. NOTE: The data set WORK.OUT_DS_HASH_01 has 399 observations and 3 variables. NOTE: Compressing data set WORK.OUT_DS_HASH_01 decreased size by 50.00 percent. Compressed is 4 pages; un-compressed would require 8 pages. NOTE: DATA statement used (Total process time): real time 21:21.56 cpu time 21:19.62 571 572 data OUT_DS_HASH_02; 573 set PHRASES; 574 if _N_=1 then do; 575 dcl hash BRANDS_02(dataset:'BRANDS_02'); 576 BRANDS_02.definekey('BRAND'); 577 BRANDS_02.definedata('BRAND','J'); 578 BRANDS_02.definedone(); 579 declare hiter IT('BRANDS_02'); 580 if 0 then set BRANDS_02; 581 end; 582 RC=IT.first(); 583 do while (RC = 0); 584 if index(trim(PHRASE), trim(BRAND)) then output; 585 RC=IT.next(); 586 end; 587 drop RC; 588 run; NOTE: There were 10797243 observations read from the data set WORK.BRANDS_02. NOTE: There were 330 observations read from the data set WORK.PHRASES. NOTE: The data set WORK.OUT_DS_HASH_02 has 376 observations and 3 variables. NOTE: Compressing data set WORK.OUT_DS_HASH_02 decreased size by 50.00 percent. Compressed is 4 pages; un-compressed would require 8 pages. NOTE: DATA statement used (Total process time): real time 22:08.78 cpu time 22:02.82 589 590 Data OUT_DS_HASH; 591 set OUT_DS_HASH_02 OUT_DS_HASH_01; 592 run; NOTE: There were 376 observations read from the data set WORK.OUT_DS_HASH_02. NOTE: There were 399 observations read from the data set WORK.OUT_DS_HASH_01. NOTE: The data set WORK.OUT_DS_HASH has 775 observations and 3 variables. NOTE: Compressing data set WORK.OUT_DS_HASH decreased size by 62.50 percent. Compressed is 6 pages; un-compressed would require 16 pages. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

pietro342 · ‎07-30-2016

Maybe, I wasn't clear enogh 😞 I' m looking for company names in texts, so dictionary is not single words. For example: "A global leader in the beverage industry, the <Coca-Cola company> offers hundreds of brands, including soft drinks, fruit ..."

pietro342 · ‎07-30-2016

LinusH, In my opinion, HASH TABLES will not work with contains. Columns name_1 and name_2 are string and it searches the contents. ChrisNZ, I checked FIND and INDEX and they were slower. What is magic=103 and option buffersize=? Have you got sample query? I think that this is a text mining problem and I thought about using SAS Sentiment analytyst, what do you think?

pietro342 · ‎07-30-2016

Hi, I have a very simple SQL procedure (Cartesian) from two tables: Base – 400K Rows, 2 columns Dict – 25 mio Rows, 2 columns: proc sql; create table Final as select a.*, b.* from Base a join Dict b on trim(a.Name_2) contains trim(b.Name_1) ; quit; it works, but very slowly. My server needs one month to process this. Do you have an idea how to speed it?

Online Status	Offline
Date Last Visited	‎08-11-2016 02:00 AM

SAS Support Communities

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Big Data - Cartesian

Follow Us

What is...