Text mining and content categorization

Extracting Numbers Embedded in Text using TGIF rule in SAS Content Cat users

Reply
Contributor
Posts: 36

Extracting Numbers Embedded in Text using TGIF rule in SAS Content Cat users

Hello,

    I have a question about trying to use numbers within a Concept rules profile. I need out whether Dates of birth are included within documents. To only include dates that are "dates of birth," I wrote a TGIF rule that stipulated that a number from 1 to 31 had to be within 10 words of the term "date of birth" or its equivalents (e.g. DOB). An example is below:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","1")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","2")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","3")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","4")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","5")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","6")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","7")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","8")}Smiley Very Happyate of birth:

Date of birth:,__TGIF:{(DIST_10,"Date of birth:","9")}Smiley Very Happyate of birth:

     The rules in SAS Content Cat includes the same rules as above, but for numbers "01" through "31."

      From what I have read in the manual, this should give me any date of birth that is labeled as such within 10 words of a date that includes a calendar day (e.g. "February 9, 2014"). However, I cannot get the output to recognize the two together.  Project settings are as follows:

      Tokenize Classifier Terms is checked

      Optimize for: "Compile Speed"

      Overlapping Concept Matches: "Longest"

      Default Relevancy Cutoff: ---

      Default Classifier Matching: "Case Insensitive"

      Relevancy Type: "Frequency-Based"

        It could be that I am missing something from the TGIF rule as the following rules also failed to find the example text given below the rules.

Rules:

Curriculum vitae1,__TGIF:{(DIST_100,"Curriculum Vitae","Education")}:Curriculum vitae

Curriculum vitae2,__TGIF:{(DIST_100,"Curriculum Vitae","Employment")}:Curriculum vitae

Curriculum vitae3,__TGIF:{(DIST_100,"Curriculum Vitae","Experience")}:Curriculum vitae

Sample Text:

1515 Stanley Drive #62
Hometown, KS 66222

perry.jameson@dbplanet.com

Curriculum Vitae

Outstanding student with experience in print and online journalism seeks a position working with a communications, public relations, or publishing firm where I can use my writing, editing, and organizational skills.

EDUCATION

The University of Missouri at Kansas City   2008-2012

B.A. Communication Studies with a concentration in Corporate Communications with a minor in Sociology

Any suggestions would be most welcome as I have tried everything that I can think of to get the concept profile to output accurately.

Ask a Question
Discussion stats
  • 0 replies
  • 303 views
  • 0 likes
  • 1 in conversation