11-14-2016 05:01 PM
11-15-2016 01:21 PM
Some things to try:
1. Try the Text Profile node using the job group as the target variable.
2. The topic node here. Be sure to work on a good stop list to remove terms that might strongly influence a topic but are not relevant to your goal
3. There are any number of things you can try. One straightforward one is to create topics or clusters on one set and then score the other to see which docs from the second set are relevant to that first set and which are not. Even better after you have investigated both sets, if you refine your topics and turn them into user topics (rather than the multiterm topics). You can really control what your looking for. For instance, you can define you own subtopics for various aspects of financial analysis with a weighted list of terms for each one and score every description and training document you have against those topics.
11-15-2016 02:29 PM
This is very helpful. Now at least I have an idea of where to focus my effort at figuring out what to use in Text Miner. I'm sure I'll have more questions, but this is enough to get me started for now. Thanks!
11-17-2016 11:28 PM
11-18-2016 09:36 AM
For a hierarchy you typically build a different model for each split of your hierarchiy. And you need enough data available as you work your way down the tree so that may not be feasible. The Text Rule Builder node should be useful if your building a predictive model for this kind of hierarchy.
11-18-2016 10:48 AM
11-21-2016 03:38 PM
If you're building a predictive model, you need training data. Hopefully hundreds or more of job descriptions and then you score your new 2 new college programs with that model that you built.
If you do no have training data. Try building user-defined topics that is based on your domain knowledge and use the topic assignment as your classification. SAS also has a product called Content Categorization that is explicitly designed for this.
Sorry, no immediate papers on "hierarchical classification", but you can google that phrase to find the challenges of it and the approaches people use.