08-05-2014 01:50 AM
I have millions of recorde in a Dataset, need to get selected Names from Name field using contains or Like function in Proc sql;
YYYY AGENCES YYYY
ZZZZ ZZZZZ ZZZZ
Out put should be like
YYYY AGENCES YYYY
For this Quirey am using
create table want as select Name
where Name contains "AGENCES" or
Name contains "COMPANY" or
Name contains "ACADEMY" or
Name contains " AGENCES " or
Name contains " COMPANY " or
Name contains " ACADEMY " or
Name contains " AGENCES" or
Name contains " COMPANY" or
Name contains " ACADEMY"
I Have orizinal Name Field Observation Count is 1,00,00,000
selected Name(Contains) count is 5,000
Instead of Conains I can use "Like" function, but it is showing in log window as Macro name is not resolved for some specific Names:
Ex: - "%X", "%SYS"
Please help me out ,
how can i create Index on Name field and reduce the system processing time.
Thanks in Advance.
08-05-2014 03:42 AM
forget SQL go for NOSQL.... ; <)
- Indexing will only work for fixed values not for fuzzy ones.
- You want to process all records not just a subset. The SQL (OLTP) is designed for simple subsets.
How to proceed:
- If your source is a RDBMS check whether it can be solve with some native sql dialect. Using SQL pass-through
When not all, data must be copied to SAS. (time/resources needed)
- Using SAS reducing processing speed is checking the system is tuned for your process (bufsize memsize and more)
- Go for sequential processing when more evaluation logic is needed (contains, like, perl expressions) it is data-cleansing.
Using sequential IO is much faster than random hitting more than 20% of the data (to evaluate) sequential will win. The datastep processes sequential, proc sql random. Having more than 1 processor you can think on splitting a big datastep accordingly and merge the results. The Same concept as hadoop gmapreduce.
08-05-2014 08:25 AM
Make a dataset to hold these index value ,and then contains operator .
input ind : $40.;
select * from have,ind where ACADEMY contains strip(ind) ;
08-05-2014 08:45 AM
Xia please also give an explanation on what will happen when running this.
Always fun to have several approaches to choose from and than evaluate them readability / generic applicable / executing performance
08-05-2014 10:22 AM
I believe, and correct me If I am wrong, he is doing a merge based on of IND to have based on the text being found, hence there would be an additional column in the output. The merge is only where both exist. Nice. I have run a couple of test though and a straight forward where clause does actually seem to run a fair bit faster than the merge approach:
do aval=1 to 1000000;
mytext="XXXXX ACADEMY"; output;
mytext="YYYY AGENCES YYYY"; output;
mytext="ZZZZ ZZZZZ ZZZZ"; output;
mytext="AAA COMPANY"; output;
mytext="BBBB BBBB"; output;
input ind : $40.;
create table WANT as
select mytext from have where mytext like '%COMPANY%' or mytext like '%ACADEMY%' or mytext like '%AGENCES%' or mytext like '%COMPANY%';
create table WANT as
select mytext,ind from have,ind where mytext contains strip(ind);
If you run these a few times I was seeing a difference of maybe 3 seconds real time and 1.9 user cpu time in favour of the where.
Note, to the OP - You can't use like "%" as any % within double quotes is treated as a macro variable. For the like to work you need to use single quotes.
08-05-2014 09:41 AM
At a bare minimum, you can remove two-thirds of your conditions. The first set (no leading or trailing blanks) will identify all the observations you are looking for. The other two sets of CONTAINS are just not needed.
08-05-2014 10:56 AM
Thank you very much for response,
1) I have 1,00,00,000 observations on NAME
2) I need to find out How many names having commercial names (commercial Names are not more than 5000)
ACADEMYXXXXXXX ----------- > ACADEMY with no Spaces
XXXXX ACADEMY XXXXXX ---------- > ACADEMY with leading and trail blanks
3) we can use like function with "%" (EX:- "%ACADAMY%" or "%ACADAMY") but after running the program, error is showing in log "MAXIMUM MACRO PARAMETERS ARE USED and some MACRO names are not resolved",
4) we can use only CONTAINS function with and with out leading and trail blanks for commercial names.
5) but it is taking 2 to 3 hours for output without INDEX on Name field. (minimum 2 hours with INDEX on name)
How can I reduce the processing time.
08-05-2014 11:09 AM
Check out my post, you cannot use "%" syntax. SAS is treating the "%ACADEMY%" as a macro variable, so it is trying to find a macro variable name ACADEMY in the macro table and not finding it. TO use a percentage sign like that, you need to put it within single quotes: '%ACADEMY%'.
However saying that, if you have 5000 commercial names then the like syntax is not going to work as there are limitations on how many you can put in. Try KSharp's code.
08-05-2014 11:20 AM
One more thought ... you mention that you already processed the data once, so it looks like you want to reduce the processing time for future runs. Why not add a field to the original data base, a 0/1 flag to indicate commercial? Even if that takes a while to run, all your subsequent runs can easily pull the proper records quickly.
08-05-2014 11:21 AM
As it analyzing all data for a lot of strings in the name variable. Use the datastep.
Maybe PRX-string matching is one option for complicated string testing. Assuming your have 5000 strings that would indicate a commercial name.
You will probably put some people incorrect into that.
This is the description what it does: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition
this is the prxmatch function call: SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition there are some examples.
I you want us to try some code for you, you have to come up with some more detailed data examples and wanted results.
We are just seeing now one name field with several options for the string and one string detection (academy) upcase/lowcase?
I did a test with Rw9 example. See the log.
The patteren is just a copy of 3 possible values, the next choice is some context driven decision.
- It could be enhanced with wildcards or others using PRX options.
- It could be enhanced by using a macro generating that string.
(combined) Do not know limitiations the prxmatch string. The proceeding document is more helpful as SAS doc.
Speed processing this one 50M records is no issue.
43 data hv_commerce(keep=name)
44 hv_persons(keep=name) ;
45 set have;
47 if _N_=1 then do;
48 retain PerlExpression;
53 if prxmatch(PerlExpression, name) then output hv_commerce ;
54 else output hv_persons ;
NOTE: There were 50000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HV_COMMERCE has 30000000 observations and 1 variables.
NOTE: The data set WORK.HV_PERSONS has 20000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 1:23.74
cpu time 1:19.19