DATA Step, Macro, Functions and more

How to reduce system processing time, Index on Specific Variables (on huge data)

Reply
Contributor
Posts: 46

How to reduce system processing time, Index on Specific Variables (on huge data)

Dear All,

I have millions of recorde in a Dataset, need to get selected Names from Name field using contains or Like function in Proc sql;

Ex:-

Name

XXXXX ACADEMY

YYYY AGENCES YYYY

ZZZZ ZZZZZ ZZZZ

AAA COMPANY

BBBB BBBB

;

Out put should be like

Name

XXXXX ACADEMY

YYYY AGENCES YYYY

AAA COMPANY

For this Quirey am using

proc sql;

create table want as select Name

from have

where Name contains "AGENCES" or

Name contains "COMPANY" or

Name contains "ACADEMY" or

Name contains " AGENCES " or

Name contains " COMPANY " or

Name contains " ACADEMY " or


Name contains " AGENCES" or

Name contains " COMPANY" or

Name contains " ACADEMY"

;

quit;

I Have orizinal Name Field Observation Count is 1,00,00,000

                                 selected Name(Contains) count is 5,000

Instead of Conains I can use "Like" function, but it is showing in log window as Macro name is not resolved for some specific Names:

Ex: - "%X", "%SYS"

Please help me out ,

how can i create Index on Name field and reduce the system processing time.

Thanks in Advance.

Trusted Advisor
Posts: 3,212

Re: Index on Specific Variables (on huge data)

forget SQL go for NOSQL.... ; <) 

The reasons:
- Indexing will only work for fixed values not for fuzzy ones.

- You want to process all records not just a subset. The SQL (OLTP) is designed for simple subsets.

How to proceed:

- If your source is a RDBMS check whether it can be solve with some native sql dialect. Using SQL pass-through

  When not all, data must be copied to SAS.  (time/resources needed)

- Using SAS reducing processing speed is checking the system is tuned for your process (bufsize memsize and more)

- Go for sequential processing when more evaluation logic is needed (contains, like, perl expressions) it is data-cleansing.

Using sequential IO is much faster than random hitting more than 20% of the data (to evaluate) sequential will win. The datastep processes sequential, proc sql random. Having more than 1 processor you can think on splitting a big datastep accordingly and merge the results. The Same concept as hadoop gmapreduce.

---->-- ja karman --<-----
Super User
Posts: 10,028

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

Make a dataset to hold these index value ,and then contains operator .

data ind;

input ind : $40.;

cards;

COMPANY

ACADEMY

AGENCES

COMPANY

;

run;

proc sql;

select * from have,ind  where ACADEMY contains strip(ind) ;

quit;

Xia Keshan

Trusted Advisor
Posts: 3,212

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

Xia please also give an explanation on what will happen when running this.
Always fun to have several approaches to choose from and than evaluate them readability / generic applicable / executing performance

---->-- ja karman --<-----
Super User
Super User
Posts: 7,954

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

I believe, and correct me If I am wrong, he is doing a merge based on of IND to have based on the text being found, hence there would be an additional column in the output.  The merge is only where both exist.  Nice.  I have run a couple of test though and a straight forward where clause does actually seem to run a fair bit faster than the merge approach:

data have;
  do aval=1 to 1000000;
    mytext="XXXXX ACADEMY"; output;
    mytext="YYYY AGENCES YYYY"; output;
    mytext="ZZZZ ZZZZZ ZZZZ"; output;
    mytext="AAA COMPANY"; output;
    mytext="BBBB BBBB"; output;
  end;
run;
data ind;
input ind : $40.;
cards;
COMPANY
ACADEMY
AGENCES
COMPANY
;
run;

proc sql;
create table WANT as
select mytext from have  where mytext like '%COMPANY%' or mytext like '%ACADEMY%' or mytext like '%AGENCES%' or mytext like '%COMPANY%';
quit;

proc sql;
create table WANT as
select mytext,ind from have,ind  where mytext contains strip(ind);
quit;

If you run these a few times I was seeing a difference of maybe 3 seconds real time and 1.9 user cpu time in favour of the where.

Note, to the OP - You can't use like "%" as any % within double quotes is treated as a macro variable.  For the like to work you need to use single quotes.

Super User
Posts: 5,503

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

At a bare minimum, you can remove two-thirds of your conditions.  The first set (no leading or trailing blanks) will identify all the observations you are looking for.  The other two sets of CONTAINS are just not needed.

Contributor
Posts: 46

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

Hi All,

Thank you very much for response,

1) I have 1,00,00,000 observations on NAME

2) I need to find out How many names having commercial names (commercial Names are not more than 5000)

Ex:- Name

XXXXX ACADEMY     

XXXXXACADEMYXXXXX    

ACADEMYXXXXXXX           ----------- > ACADEMY with no Spaces

XXXXX ACADEMY XXXXXX   ---------- >  ACADEMY with leading and trail blanks

3) we can use like function with "%" (EX:- "%ACADAMY%" or "%ACADAMY") but after running the program, error is showing in log "MAXIMUM MACRO PARAMETERS ARE USED and some MACRO names are not resolved",

4) we can use only CONTAINS function with and with out leading and trail blanks for commercial names.

5) but it is taking 2 to 3 hours for output without INDEX on Name field. (minimum 2 hours with INDEX on name)

How can I reduce the processing time.

Thanks you,

Super User
Super User
Posts: 7,954

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

Check out my post, you cannot use "%" syntax.  SAS is treating the "%ACADEMY%" as a macro variable, so it is trying to find a macro variable name ACADEMY in the macro table and not finding it.  TO use a percentage sign like that, you need to put it within single quotes: '%ACADEMY%'.

However saying that, if you have 5000 commercial names then the like syntax is not going to work as there are limitations on how many you can put in.  Try KSharp's code.

Super User
Posts: 5,503

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

One more thought ... you mention that you already processed the data once, so it looks like you want to reduce the processing time for future runs.  Why not add a field to the original data base, a 0/1 flag to indicate commercial?  Even if that takes a while to run, all your subsequent runs can easily pull the proper records quickly.

Trusted Advisor
Posts: 3,212

Re: How to reduce system processing time, Index on Specific Variables (on huge data)

As it analyzing all data for a lot of strings in the name variable. Use the datastep.
Maybe PRX-string matching is one option for complicated string testing.  Assuming your have 5000 strings that would indicate a commercial name.

You will probably put some people incorrect into that.

This is the description what it does: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition

this is the prxmatch function call: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition  there are some examples.


I you want us to try some code for you, you have to come up with some more detailed data examples and wanted results.
We are just seeing now one name field with several options for the string and one string detection (academy) upcase/lowcase?

I did a test with Rw9 example. See the log.

The patteren is just a copy of 3 possible values, the next choice is some context driven decision.  

- It could be enhanced with wildcards or others using PRX options.
- It could be enhanced by using a macro generating that string.

  (combined)  Do not know limitiations the prxmatch string. The proceeding document is more helpful as SAS doc.  

Speed processing this one 50M records is no issue.

43         data hv_commerce(keep=name)

44              hv_persons(keep=name) ;

45           set have;

46        

47           if _N_=1 then do;

48               retain PerlExpression;

49               pattern="/academy|agences|company/i";

50               PerlExpression=prxparse(pattern);

51           end;

52        

53           if prxmatch(PerlExpression, name) then output hv_commerce ;

54           else output hv_persons ;

55        

56         run;

NOTE: There were 50000000 observations read from the data set WORK.HAVE.

NOTE: The data set WORK.HV_COMMERCE has 30000000 observations and 1 variables.

NOTE: The data set WORK.HV_PERSONS has 20000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

       real time           1:23.74

       cpu time            1:19.19

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 9 replies
  • 298 views
  • 4 likes
  • 5 in conversation