DATA Step, Macro, Functions and more

Find list of words from Table

Reply
Contributor
Posts: 46

Find list of words from Table

Hi All,

In my first table i have list of millions of names

and in my secod table i have hundreds of names like below.

Need to find out from first table where ever names contains from second table with or without lead or trail blanks.

First TableSecond Table
IKON INTERNATIONALCUSTOMERS
JOHN RICE MILL ( PVT ) LTDELECTRONICS
GLOBAL EQUIPMENTSDRUGS
LIFE MEDICAREENTERPIRSES
SELF HELP GROUPINTERNATIONAL
IRFAN DAIRY( PVT ) LTD
NICCO CORPN LTD LC DEVOLVEDLTD
H P INDUSTRIESPVT
FOURTUNE GARMENTS LTDELECTRONIC
J B MARKETINGGROUP
M V AND COMPANYINDUSTRIES
ROYAL AGRO FOODS INDUSTRIES
WEBSOL ENERGY SYSTEM LIMITED
COMPETITIVE  EDGE CONSULTANTS
FOOD SHOPPERET.TR.
P P ELECTRONIC
LOVELY MEDICAL HALL
RASSON ENERGY PVT LTD


Super User
Super User
Posts: 7,942

Re: Find list of words from Table

Well, you have a few options, but millions of rows is going to have an impact.  Basically you need to merge (join hash etc.) the data somehow.  If I was doing it myself on smaller data then:

proc sql;

     create table WANT as

     select     A.*

     from        FIRST A

     left join   SECOND B

     on           A.VAR like B.VAR

     where    B.VAR is not null;

quit;

However, proc sql isn't the best on resources, so maybe a datastep (or someone can provide a hash):

data _null_;

     set second end=last;

     if _n_=1 then call execute('data want; set first; if var contains "'||strip(var)||'"');

     else do call execute(' or var contains "'||strip(var)||'"');

     if last then call execute(' then output; run;');

run;

This will generate a datastep with the list of values in an if statement.

Super User
Posts: 10,018

Re: Find list of words from Table

CODE NOT TESTED.

proc sql;

create table want as

select *

  from first as a,second as b

   where a.name contains strip(b.name);

quit;

Xia Keshan

Contributor
Posts: 46

Re: Find list of words from Table

Thank you,

Actually here my requirement is

1. Name does not contain numbers, Selected Special characters.

2. Does not contain S/O, D/O, W/O and C/O .....

3. Does not Single Token

4. At least one token having minimum 3 characters

5. Does not contain selected names (as mentioned above in Table 2)

Here I need to run the 5 points in a single step and create invalid name dataset.

Developed the code as per above requirement but i have confusion to get selected names.

In my code for selected names I wrote like this.

NAME Contains "ACADEMY" or

NAME Contains "ACCCURED" or

NAME Contains "ACCESSORIES" or

NAME Contains "ACCOUNT" or

.

.

.

NAME Contains "ZINC"

When I am running the same on millions of records it is taking huge time and systems are hanging.

Thanks in Advance….!

Super User
Posts: 3,250

Re: Find list of words from Table

How many selected names do you have? You could do want you want without table joins by either format lookup or hash lookup as long as the list of selected names isn't as huge.

Contributor
Posts: 46

Re: Find list of words from Table

In my second table (selected names) 2K + and in my first table more than 5 million records(this is depending on cycle wise)

here need to find out how many selected names are available from first table.

Super User
Posts: 3,250

Re: Find list of words from Table

OK 2K isn't so big so I would apply all of the exclusions first, in order of the most likely to occur as this will speed things up.

I suggest not writing out the exclusions to a separate file, at least not in the same run, to save time.

Then as already mentioned I would do either a format or hash look up on NAME to find occurrences of the required words.

Try fine-tuning this on a sample of your data until you are happy with the performance, then try it over the full volume.

Super User
Posts: 7,758

Re: Find list of words from Table

Well, a complex where condition executed over a large dataset will always take time and consume resources. If not, you could do it by hand using pen&paper. Just kidding.

The only thing you can optimize is to reduce the amount of work you have to do. 's solution is a nice way to go about it.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 46

Re: Find list of words from Table

When I used

NAME Contains " ACADEMY " or

it is ignoring the Leading and trail blanks

For Ex:-

1. "CRICKETACADEMY"

2."CRICKETACADEMY UK"

3."CRICKETACADEMYUK"

4."CRICKET ACADEMYUK"

5."CRICKET ACADEMY UK"

after running the code am getting 5. "CRICKET ACADEMY UK" only

 

Insted of this

NAME Contains " ACADEMY "

we have four possible patterns like this (with lead and, or trail blanks)

NAME Contains "ACADEMY" or

NAME Contains " ACADEMY" or

NAME Contains " ACADEMY " or

NAME Contains "ACADEMY "


but it is taking number of iterations.


Thank you.


Ask a Question
Discussion stats
  • 8 replies
  • 368 views
  • 3 likes
  • 5 in conversation