BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shayan2012
Quartz | Level 8

Hello All;

I am trying to search some names in a url. I am using the index function to find the word, for example:

if index(line1,'GEORGE') ne 0 then do;

countG = countG + 1 ;

end;

My problem is that, I want to choose the names from a list. So, I want SAS to read the first name from a local data set, put it as the argument in the index function and search for it, and then goes for the second name from the list, and so on. The list is huge so I cannot list them manually.

I am not very knowledgeable in SAS, so I would be grateful if help me how to write this part.

Thank you very much!

1 ACCEPTED SOLUTION

Accepted Solutions
Scott_Mitchell
Quartz | Level 8

Hi Shayan2012,

Welcome to SAS.  As you are new to SAS I have broken down how the code works within the comments, however if there is something you don't understand please let me know.  Loading a dataset into a hash object is limited by the amount of memory you have available.  You haven't mentioned in your post exactly how large the list is, so run the below code and see how you progress.

/*SET UP TEST DATA*/

DATA HAVE;

LENGTH LINE1 $200;

INFILE DATALINES;

INPUT LINE1 $;

DATALINES;

HHHHHHHHHHHHGEORGEGARRY

5646546465HARRYGARRYBARRY

;

RUN;

/*SET UP LIST OF VALUES WE WISH TO LOOK FOR*/

DATA LIST;

LENGTH NAMESLIST $7;

NAMESLIST = "HARRY";

OUTPUT;

NAMESLIST = "BARRY";

OUTPUT;

NAMESLIST = "GEORGE";

OUTPUT;

NAMESLIST = "GARRY";

OUTPUT;

RUN;

DATA LOOKUP;

/*LOAD LIST DATASET INTO A HASH TABLE AND DECLARE THE ITERATOR*/

IF _N_ = 1 THEN DO;

LENGTH NAMESLIST $6;

DECLARE HASH HA(DATASET:"LIST");

DECLARE HITER HI("HA");

HA.DEFINEKEY("NAMESLIST");

HA.DEFINEDATA("NAMESLIST");

HA.DEFINEDONE();

CALL MISSING(NAMESLIST);

END;

/*READ IN HAVE DATASET AS NORMAL*/

SET HAVE;

/*READ IN THE FIRST RECORD IN HASH TABLE*/

RC = HI.FIRST();

/*CREATE AN ARRAY CALLED _COUNT CONTAIN VARIABLES COUNTA TO COUNTZ*/

ARRAY _COUNT {*} COUNTA COUNTB COUNTC COUNTD COUNTE COUNTF COUNTG COUNTH COUNTI

     COUNTJ COUNTK COUNTL COUNTM COUNTN COUNTO COUNTP COUNTQ COUNTR

     COUNTS COUNTT COUNTU COUNTV COUNTW COUNTX COUNTY COUNTZ;

CALL MISSING(COUNTA,COUNTB,COUNTC,COUNTD,COUNTE,COUNTF,COUNTG,COUNTH,COUNTI,

    COUNTJ,COUNTK,COUNTL,COUNTM,COUNTN,COUNTO,COUNTP,COUNTQ,COUNTR,

    COUNTS,COUNTT,COUNTU,COUNTV,COUNTW,COUNTX,COUNTY,COUNTZ);

DO WHILE (RC = 0);

  /*IF THE NAME IN THE HASH TABLE RECORD APPEARS IN THE STRING FROM THE HAVE DATASET THEN*/

  IF INDEX(LINE1,TRIM(NAMESLIST)) ~= 0 THEN DO;

  /*SET NUMBER OF ITERATIONS REQUIRED TO MOVE THROUGH ALL VARIABLES WITHIN THE ARRAY*/

  DO I = 1 TO 26;

  /*IF THE LAST VALUE IN THE VARIABLE ARRAY APPEARS AS THE FIRST CHARACTER OF THE VALUE IN THE NAMELIST VARIABLE THEN */

  IF TRIM(SUBSTR(VNAME(_COUNT{I}),6,1)) = TRIM(SUBSTR(NAMESLIST,1,1)) THEN DO;

  /*ADD 1 TO THE RELEVANT ARRAY ELEMENT*/

  _COUNT{I} = SUM(_COUNT{I},1);

  /*ONCE THE NAME IS FOUND THEN LEAVE THE DO LOOP*/

LEAVE;

  END;

  END;

  END;

  /*READ IN THE NEXT RECORD FROM THE HASH TABLE*/

  RC = HI.NEXT();

END;

DROP I;

RUN;

View solution in original post

9 REPLIES 9
Kurt_Bremser
Super User

Try this and see if it meets some of your requirements:

data lookup; * this is your list with names;

length name $20;

input name;

datalines;

George

Harry

John

;

run;

data have; * this is the dataset having the field you want to explore;

infile cards truncover;

input line1 $100.;

datalines;

Some text that contains George

Some text that contains Harry

Some text that contains nothing

;

run;

data _null_; * build a piece of code dynamically;

set lookup end=last;

length commandstring $100;

if _N_ = 1 then call execute("data want;set have;");

commandstring = "if index(line1,'"!!strip(name)!!"') ne 0 then count_"!!strip(name)!!" = sum(count_"!!strip(name)!!",1);";

call execute(commandstring);

if last then call execute("run;");

run;

Scott_Mitchell
Quartz | Level 8

Hi Shayan2012,

Welcome to SAS.  As you are new to SAS I have broken down how the code works within the comments, however if there is something you don't understand please let me know.  Loading a dataset into a hash object is limited by the amount of memory you have available.  You haven't mentioned in your post exactly how large the list is, so run the below code and see how you progress.

/*SET UP TEST DATA*/

DATA HAVE;

LENGTH LINE1 $200;

INFILE DATALINES;

INPUT LINE1 $;

DATALINES;

HHHHHHHHHHHHGEORGEGARRY

5646546465HARRYGARRYBARRY

;

RUN;

/*SET UP LIST OF VALUES WE WISH TO LOOK FOR*/

DATA LIST;

LENGTH NAMESLIST $7;

NAMESLIST = "HARRY";

OUTPUT;

NAMESLIST = "BARRY";

OUTPUT;

NAMESLIST = "GEORGE";

OUTPUT;

NAMESLIST = "GARRY";

OUTPUT;

RUN;

DATA LOOKUP;

/*LOAD LIST DATASET INTO A HASH TABLE AND DECLARE THE ITERATOR*/

IF _N_ = 1 THEN DO;

LENGTH NAMESLIST $6;

DECLARE HASH HA(DATASET:"LIST");

DECLARE HITER HI("HA");

HA.DEFINEKEY("NAMESLIST");

HA.DEFINEDATA("NAMESLIST");

HA.DEFINEDONE();

CALL MISSING(NAMESLIST);

END;

/*READ IN HAVE DATASET AS NORMAL*/

SET HAVE;

/*READ IN THE FIRST RECORD IN HASH TABLE*/

RC = HI.FIRST();

/*CREATE AN ARRAY CALLED _COUNT CONTAIN VARIABLES COUNTA TO COUNTZ*/

ARRAY _COUNT {*} COUNTA COUNTB COUNTC COUNTD COUNTE COUNTF COUNTG COUNTH COUNTI

     COUNTJ COUNTK COUNTL COUNTM COUNTN COUNTO COUNTP COUNTQ COUNTR

     COUNTS COUNTT COUNTU COUNTV COUNTW COUNTX COUNTY COUNTZ;

CALL MISSING(COUNTA,COUNTB,COUNTC,COUNTD,COUNTE,COUNTF,COUNTG,COUNTH,COUNTI,

    COUNTJ,COUNTK,COUNTL,COUNTM,COUNTN,COUNTO,COUNTP,COUNTQ,COUNTR,

    COUNTS,COUNTT,COUNTU,COUNTV,COUNTW,COUNTX,COUNTY,COUNTZ);

DO WHILE (RC = 0);

  /*IF THE NAME IN THE HASH TABLE RECORD APPEARS IN THE STRING FROM THE HAVE DATASET THEN*/

  IF INDEX(LINE1,TRIM(NAMESLIST)) ~= 0 THEN DO;

  /*SET NUMBER OF ITERATIONS REQUIRED TO MOVE THROUGH ALL VARIABLES WITHIN THE ARRAY*/

  DO I = 1 TO 26;

  /*IF THE LAST VALUE IN THE VARIABLE ARRAY APPEARS AS THE FIRST CHARACTER OF THE VALUE IN THE NAMELIST VARIABLE THEN */

  IF TRIM(SUBSTR(VNAME(_COUNT{I}),6,1)) = TRIM(SUBSTR(NAMESLIST,1,1)) THEN DO;

  /*ADD 1 TO THE RELEVANT ARRAY ELEMENT*/

  _COUNT{I} = SUM(_COUNT{I},1);

  /*ONCE THE NAME IS FOUND THEN LEAVE THE DO LOOP*/

LEAVE;

  END;

  END;

  END;

  /*READ IN THE NEXT RECORD FROM THE HASH TABLE*/

  RC = HI.NEXT();

END;

DROP I;

RUN;

Shayan2012
Quartz | Level 8

Thanks a lot, Scott_Mitchell!

I learnt many new things from your code, but I am not sure if I fully understood how it works in some parts.

As far as I figured, you first construct a hash table of names from the list dataset, and then match these names in the line1 varibale from have dataset. However, I am not sure what is the purpose of the " DO WHILE" loop that comes afterwards, and I am not sure what is this condition testing for:

  IF TRIM(SUBSTR(VNAME(_COUNT{I}),6,1)) = TRIM(SUBSTR(NAMESLIST,1,1)) THEN DO;


And, finally, I ran the exact code, and this is the final table that I am getting:

( I have omitted the other COUNT ARRAYS which are all missing)


NAMESLISTLINE1RCcountBcountGcountH

GARRY

HHHHHHHHHHHHGEORGEGARRY160038.2.
GARRY5646546465HARRYGARRYBARRY160038111


So, can you please desribe what does this RC and these counters are telling me? and why we can not see GEORGE in the nameslist?


Thank you so so much,

shayan









Scott_Mitchell
Quartz | Level 8

Hi Shayan,

I am more than happy to explain the code.

Let's look at the DO WHILE loop first:

You are right that we are loading the dataset NAMESLIST into a hash table, the lookup key being NAMESLIST also, because that is what I named the the variable (probably a little confusing for a new user in hindsight).  We also declare a hash iterator which allows us to iterate through the values contained within the hash table.  This is why we have the do while loop and the RC record control variables.

HI.FIRST() moves the iterator to the top of the hash table, we do this for each record read in by the SET HAVE statement.  If it is successful in doing so then RC = 0 else RC not = 0.  If RC = 0 then the do loop will process the code for the first time with NAMESLIST containing the first value in the hash table throughout eventually reaching RC = HI.NEXT().  RC = HI.NEXT() tells the iterator to read in the next record.  If successful RC = 0 else RC not = 0.  If RC = 0 the next record is read in and the do loop is processed again. If RC not = 0 then the end of the hash table has been reached then exit the do loop as DO WHILE (RC = 0) is false.  The next record from HAVE is read in and we go through the above process again.

Now let's explore the IF TRIM(SUBSTR(VNAME(_COUNT{I}),6,1)) = TRIM(SUBSTR(NAMESLIST,1,1)) THEN DO syntax.


_COUNT is the name of the ARRAY created earlier in the code by the ARRAY statement.  _COUNT{I} means go to element I (which is the value from the created by the DO I = 1 TO 26 loop).  Typically we would want the value contained within _COUNT{I}, however by using the VNAME function we obtain the variable name associated with the relevant iteration of _COUNT{I}.  So if I = 1 then VNAME( _COUNT{I}) = COUNTA, if I = 2 then VNAME(_COUNT{I}) = COUNTB and so on. We then use SUBSTR to extract the last character of the variable name and TRIM the result to remove any trailing blanks.  TRIM(SUBSTR(NAMESLIST,1,1)) simply obtains the first character from the NAMESLIST variable from the current iteration through the hash table and trims the result. If the two values are equal then we add 1 to the count of that variable.


To answer you question regarding why you can't see GEORGE in NAMELIST.  Every time you iterate through the values in hash table the value in NAMESLIST is overwritten.  You can see that both GEORGE and GARRY have been located in the first OBS of the HAVE table because COUNTG = 2.  If you don't believe me that GEORGE is being evaluated simply add PUT _N_= NAMESLIST=; immediately after DO WHILE (RC = 0); and you will see what NAMESLIST contained for each record of HAVE in the log.


I hope that this helps you better understand my code.

Shayan2012
Quartz | Level 8

Thank you very much for the comprehensive description, Scott_mitchell. You're awesome!

Tom
Super User Tom
Super User

You might be able to use PROC SQL to do this.

* this is your list with names;

data lookup;

  length name $20;

  input name;

cards;

George

Harry

John

Sam

;

run;

* this is the dataset having the field you want to explore;

data have;

  input line1 $80.;

cards;

Some text that contains George

Some text that contains Harry

None of the names

Two of the names George, and JOHN.

Some text that contains nothing

;

run;

proc sql noprint ;

  create table want as

    select a.name,count(line1) as count

    from lookup a left join have b

    on index(upcase(b.line1),upcase(trim(a.name)))

    group by 1

    order by 1

  ;

run;

Shayan2012
Quartz | Level 8

Thank you very much Tom,

I think I understand your code, but one thing that I am trying to do is to extract the line that actually contains the name. So, is it a way to modify the code such that the WANT table also includes the line that the name e.g. GEORGE was find in?

Tom
Super User Tom
Super User

I thought the point was to count them?

SAS has a nice extension to normal SQL syntax in that it allows you to keep both the aggregate functions like COUNT() and the detail rows. It will happily populate all of the rows for the group with the COUNT() for that group.


proc sql noprint ;

  create table want as

    select a.name,count(b.line1) as count, b.line1

    from lookup a left join have b

    on index(upcase(b.line1),upcase(trim(a.name)))

    group by 1

    order by 1

  ;

run;


Normally with SQL queries you have to remerge the counts yourself.

stat_sas
Ammonite | Level 13

data lookup;
  length name $20;
  input name;
cards;
George
Harry
John
Sam
;
run;

data have;
input line1 $80.;
cards;
Some text that contains George
Some text that contains Harry
None of the names
Two of the names George, and JOHN.
Some text that contains nothing
;
run;

proc sql;
create table have as
select * from lookup,have
order by name;
quit;

data want(keep=name CountG);
set have;
by name;
if first.name then CountG=0;
if index(upcase(line1),upcase(strip(name)))>0 then CountG+1;
if last.name;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1002 views
  • 12 likes
  • 5 in conversation