Scoring and Classifying New Data

ChrisFromMaryland · Posted 01-12-2015 01:39 PM

Is there a way to score new data against an existing cluster set and to identify new clusters that are in that new data?

Perhaps an example would be best. Let's say there is an event coming up about the environment. I have a dataset of 10,00 online news articles over the last month that talk about the environment. Some are off-topic, like computer environments or school environments, and others are on-topic, like carbon emissions or sea water rising. So far, so good.

Now the event is held, a big international event. Lots of press coverage. My task ot to see how the on-topic conversations changed in volume and how long that change lasted.

So let's say that after the event is held I download a new set of online news articles, say 20,000 records this time. What I want to do are two things. First, I want to score the new data against the rules that were built in the pre-event processing. Think of it as an apples to apples analysis: using the same rules, the conversations on carbon emissions grew by X percent and lasted Y days; the conversations about sea water rising grew by A percent and lasted B days; the conversations on computer environments and school enviroments did not change. However, and this to me is the tough part, I want to uncover new topics (custers) that may arise. So let's say that after the event a discussion about solar energy emerges that was not in the discussions prior to the event. (I know this sounds weird, but it happens to be true because I've already done the pre-event analysis). How do I identify these new cluster did not exist in the existing clustering routine?

FionaMcNeill · Posted 02-09-2015 01:36 PM

Hi Chris -

This paper on custom entities may be of interest: http://www.sas.com/en_us/whitepapers/discovering-what-you-want-107347.html

It's a way to include pre-defined entities into a discovery analysis with SAS Text Miner.

You may also be interested in the text profile node in SAS Text Miner, used to associated descriptive terms with different levels of a dependent (target) variable - including time.

Hope this helps,

Scoring and Classifying New Data

Re: Scoring and Classifying New Data

Catch up on SAS Innovate 2026