Text mining and content categorization

Scoring Text Documents in SAS VDMML using rules created with PROC BOOLRULE

Reply
Highlighted
Senior User
Posts: 1

Scoring Text Documents in SAS VDMML using rules created with PROC BOOLRULE

Hello,

I currently have used SAS EMTM to parse and filter text documents and then create classifier models using the Text Rule Builder. After the rules have been created, I can then score new incoming text documents to provide them with probabilities and classifications.

I have access to SAS Studio 4.2 for SAS Viya 3.2 and have been trying to replicate this process by using PROC TEXTMINE and PROC BOOLRULE to create the boolean rules. Everything is working as intended, and I am able to create CAS datasets for rules, ruleterms, and candidate terms.

The part where I am stuck at is now using PROC TMSCORE and PROC BOOLRULE to classify new text documents. I am able to score the new text terms, but I am not sure how to actually put them all back together and classify an entire text document.

 

libname sk "/sasdata/";

libname mycaslib cas caslib=casuser;

caslib _all_ assign;

%if not %sysfunc(exist(mycaslib.tox_train)) %then %do;

  proc casutil;
    load data=sk.tox_train casout="ToxTrain";
  run;

%end;

%if not %sysfunc(exist(mycaslib.ToxTrain)) %then %do;
  
  data mycaslib.ToxTrain;
    set sk.tox_train;
  run;	 

%end;

ods noproctitle;
libname _tmpcas_ cas;

proc textmine data=MYCASLIB.TOXTRAIN;
	var comment_text;
	doc_id id;
	parse outparent=MYCASLIB.term_by_doc outterms=MYCASLIB.parsed_terms outconfig=MYCASLIB.parseconfig;
run;

proc boolrule data=MYCASLIB.term_by_doc docinfo=MYCASLIB.TOXTRAIN 
		docid=_document_ terminfo=MYCASLIB.parsed_terms termid=_termnum_;
	docinfo id=id targets=(toxic severe_toxic obscene threat insult identity_hate) 
		events=('1' '1' '1' '1' '1' '1');
	terminfo id=key label=term;
	output rules=MYCASLIB.Tox_Rules ruleterms=MYCASLIB.Tox_Rules_Term 
		candidateterms=MYCASLIB.Tox_Candidate_Term;
run;

libname _tmpcas_;

%let _path = /sasdata/psi-viya-worker-toxic/file_dance;
%let _file = /toxic_detection_sas_work;
%let _type = .json;

%macro File_Check();

filename clus "&_path.&_file.&_type";
libname jsonin JSON fileref=clus;

data tox2score;
	set jsonin.data(drop=ordinal:);
run;

%mend();

%File_Check();

libname mycaslib cas caslib=casuser;

caslib _all_ assign;

%if not %sysfunc(exist(mycaslib.Tox2Score)) %then %do;

  proc casutil;
    load data=work.tox2score casout="Tox2Score";
  run;

%end;

%if not %sysfunc(exist(mycaslib.Tox2Score)) %then %do;
  
  data mycaslib.Tox2Score;
    set work.tox2score;
  run;	 

%end;

proc tmscore
   data       = mycaslib.tox2score
   terms      = mycaslib.parsed_terms
   config     = mycaslib.parseconfig
   outparent  = mycaslib.tox_score_bow;
   doc_id tweet_to_api_request_id;
   var text;
run;

proc boolrule
   data        =   mycaslib.tox_score_bow
   docid       =   _document_
   termid      =   _termnum_;
   score
      ruleterms = mycaslib.tox_rules_term
      outmatch  = mycaslib.tox_match;
run;

 

Perhaps I am missing one final step. I have added an attachment of a screenshot of the PROC BOOLRULE result.

 

Boolrule Results.JPG

 

Again, this gives me information on terms within the document, but what I am trying to use this to go back and provide a probability and classification for the entire text document.

 

Any information or comments about this would be appreciated.

Thanks!

Ask a Question
Discussion stats
  • 0 replies
  • 248 views
  • 0 likes
  • 1 in conversation