BookmarkSubscribeRSS Feed
ChrisFromMaryland
Calcite | Level 5

Is there a way to score new data against an existing cluster set and to identify new clusters that are in that new data?

Perhaps an example would be best.  Let's say there is an event coming up about the environment.  I have a dataset of 10,00 online news articles over the last month that talk about the environment.  Some are off-topic, like computer environments or school environments, and others are on-topic, like carbon emissions or sea water rising.  So far, so good.

Now the event is held, a big international event.  Lots of press coverage.  My task ot to see how the on-topic conversations changed in volume and how long that change lasted. 

So let's say that after the event is held I download a new set of online news articles, say 20,000 records this time.  What I want to do are two things. First, I want to score the new data against the rules that were built in the pre-event processing.  Think of it as an apples to apples analysis: using the same rules, the conversations on carbon emissions grew by X percent and lasted Y days; the conversations about sea water rising grew by A percent and lasted B days; the conversations on computer environments and school enviroments did not change.  However, and this to me is the tough part, I want to uncover new topics (custers) that may arise.  So let's say that after the event a discussion about solar energy emerges that was not in the discussions prior to the event.  (I know this sounds weird, but it happens to be true because I've already done the pre-event analysis).  How do I identify these new cluster did not exist in the existing clustering routine? 

1 REPLY 1
FionaMcNeill
SAS Employee

Hi Chris -

This paper on custom entities may be of interest:  http://www.sas.com/en_us/whitepapers/discovering-what-you-want-107347.html

It's a way to include pre-defined entities into a discovery analysis with SAS Text Miner.

You may also be interested in the text profile node in SAS Text Miner, used to associated descriptive terms with different levels of a dependent (target) variable - including time.

Hope this helps,

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1040 views
  • 1 like
  • 2 in conversation