BookmarkSubscribeRSS Feed
sas_lak
Quartz | Level 8

Dear All,

I have millions of recorde in a Dataset, need to get selected Names from Name field using contains or Like function in Proc sql;

Ex:-

Name

XXXXX ACADEMY

YYYY AGENCES YYYY

ZZZZ ZZZZZ ZZZZ

AAA COMPANY

BBBB BBBB

;

Out put should be like

Name

XXXXX ACADEMY

YYYY AGENCES YYYY

AAA COMPANY

For this Quirey am using

proc sql;

create table want as select Name

from have

where Name contains "AGENCES" or

Name contains "COMPANY" or

Name contains "ACADEMY" or

Name contains " AGENCES " or

Name contains " COMPANY " or

Name contains " ACADEMY " or


Name contains " AGENCES" or

Name contains " COMPANY" or

Name contains " ACADEMY"

;

quit;

I Have orizinal Name Field Observation Count is 1,00,00,000

                                 selected Name(Contains) count is 5,000

Instead of Conains I can use "Like" function, but it is showing in log window as Macro name is not resolved for some specific Names:

Ex: - "%X", "%SYS"

Please help me out ,

how can i create Index on Name field and reduce the system processing time.

Thanks in Advance.

9 REPLIES 9
jakarman
Barite | Level 11

forget SQL go for NOSQL.... ; <) 

The reasons:
- Indexing will only work for fixed values not for fuzzy ones.

- You want to process all records not just a subset. The SQL (OLTP) is designed for simple subsets.

How to proceed:

- If your source is a RDBMS check whether it can be solve with some native sql dialect. Using SQL pass-through

  When not all, data must be copied to SAS.  (time/resources needed)

- Using SAS reducing processing speed is checking the system is tuned for your process (bufsize memsize and more)

- Go for sequential processing when more evaluation logic is needed (contains, like, perl expressions) it is data-cleansing.

Using sequential IO is much faster than random hitting more than 20% of the data (to evaluate) sequential will win. The datastep processes sequential, proc sql random. Having more than 1 processor you can think on splitting a big datastep accordingly and merge the results. The Same concept as hadoop gmapreduce.

---->-- ja karman --<-----
Ksharp
Super User

Make a dataset to hold these index value ,and then contains operator .

data ind;

input ind : $40.;

cards;

COMPANY

ACADEMY

AGENCES

COMPANY

;

run;

proc sql;

select * from have,ind  where ACADEMY contains strip(ind) ;

quit;

Xia Keshan

jakarman
Barite | Level 11

Xia please also give an explanation on what will happen when running this.
Always fun to have several approaches to choose from and than evaluate them readability / generic applicable / executing performance

---->-- ja karman --<-----
RW9
Diamond | Level 26 RW9
Diamond | Level 26

I believe, and correct me If I am wrong, he is doing a merge based on of IND to have based on the text being found, hence there would be an additional column in the output.  The merge is only where both exist.  Nice.  I have run a couple of test though and a straight forward where clause does actually seem to run a fair bit faster than the merge approach:

data have;
  do aval=1 to 1000000;
    mytext="XXXXX ACADEMY"; output;
    mytext="YYYY AGENCES YYYY"; output;
    mytext="ZZZZ ZZZZZ ZZZZ"; output;
    mytext="AAA COMPANY"; output;
    mytext="BBBB BBBB"; output;
  end;
run;
data ind;
input ind : $40.;
cards;
COMPANY
ACADEMY
AGENCES
COMPANY
;
run;

proc sql;
create table WANT as
select mytext from have  where mytext like '%COMPANY%' or mytext like '%ACADEMY%' or mytext like '%AGENCES%' or mytext like '%COMPANY%';
quit;

proc sql;
create table WANT as
select mytext,ind from have,ind  where mytext contains strip(ind);
quit;

If you run these a few times I was seeing a difference of maybe 3 seconds real time and 1.9 user cpu time in favour of the where.

Note, to the OP - You can't use like "%" as any % within double quotes is treated as a macro variable.  For the like to work you need to use single quotes.

Astounding
PROC Star

At a bare minimum, you can remove two-thirds of your conditions.  The first set (no leading or trailing blanks) will identify all the observations you are looking for.  The other two sets of CONTAINS are just not needed.

sas_lak
Quartz | Level 8

Hi All,

Thank you very much for response,

1) I have 1,00,00,000 observations on NAME

2) I need to find out How many names having commercial names (commercial Names are not more than 5000)

Ex:- Name

XXXXX ACADEMY     

XXXXXACADEMYXXXXX    

ACADEMYXXXXXXX           ----------- > ACADEMY with no Spaces

XXXXX ACADEMY XXXXXX   ---------- >  ACADEMY with leading and trail blanks

3) we can use like function with "%" (EX:- "%ACADAMY%" or "%ACADAMY") but after running the program, error is showing in log "MAXIMUM MACRO PARAMETERS ARE USED and some MACRO names are not resolved",

4) we can use only CONTAINS function with and with out leading and trail blanks for commercial names.

5) but it is taking 2 to 3 hours for output without INDEX on Name field. (minimum 2 hours with INDEX on name)

How can I reduce the processing time.

Thanks you,

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Check out my post, you cannot use "%" syntax.  SAS is treating the "%ACADEMY%" as a macro variable, so it is trying to find a macro variable name ACADEMY in the macro table and not finding it.  TO use a percentage sign like that, you need to put it within single quotes: '%ACADEMY%'.

However saying that, if you have 5000 commercial names then the like syntax is not going to work as there are limitations on how many you can put in.  Try KSharp's code.

Astounding
PROC Star

One more thought ... you mention that you already processed the data once, so it looks like you want to reduce the processing time for future runs.  Why not add a field to the original data base, a 0/1 flag to indicate commercial?  Even if that takes a while to run, all your subsequent runs can easily pull the proper records quickly.

jakarman
Barite | Level 11

As it analyzing all data for a lot of strings in the name variable. Use the datastep.
Maybe PRX-string matching is one option for complicated string testing.  Assuming your have 5000 strings that would indicate a commercial name.

You will probably put some people incorrect into that.

This is the description what it does: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition

this is the prxmatch function call: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition  there are some examples.


I you want us to try some code for you, you have to come up with some more detailed data examples and wanted results.
We are just seeing now one name field with several options for the string and one string detection (academy) upcase/lowcase?

I did a test with Rw9 example. See the log.

The patteren is just a copy of 3 possible values, the next choice is some context driven decision.  

- It could be enhanced with wildcards or others using PRX options.
- It could be enhanced by using a macro generating that string.

  (combined)  Do not know limitiations the prxmatch string. The proceeding document is more helpful as SAS doc.  

Speed processing this one 50M records is no issue.

43         data hv_commerce(keep=name)

44              hv_persons(keep=name) ;

45           set have;

46        

47           if _N_=1 then do;

48               retain PerlExpression;

49               pattern="/academy|agences|company/i";

50               PerlExpression=prxparse(pattern);

51           end;

52        

53           if prxmatch(PerlExpression, name) then output hv_commerce ;

54           else output hv_persons ;

55        

56         run;

NOTE: There were 50000000 observations read from the data set WORK.HAVE.

NOTE: The data set WORK.HV_COMMERCE has 30000000 observations and 1 variables.

NOTE: The data set WORK.HV_PERSONS has 20000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

       real time           1:23.74

       cpu time            1:19.19

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1076 views
  • 4 likes
  • 5 in conversation