Text mining and content categorization

Vectors / Vector Space Model

Reply
Occasional Contributor
Posts: 5

Vectors / Vector Space Model

Hello,

 

I have an observation with two text fields that contain listings of error keys.

  

I need to convert these fields into two vectors the have a zero/one indicator for each error key. I then need to be able to perform some basic matrix algebra on the vectors and store the results (a numeric value) in a third picture.

 

 

EXAMPLE

Field 1: “112 1454 122 342”

Field 2: “122 1343 32”

 

Key for Vector Element          112 1454 122 342 1343 32

Field 1 Vector:                          1      1       1    1      0     0

Field 2 Vector:                          0      0       1    0      1     1     

 

 

This is essentially a numerical application of the Salton Wong and Yang (1975) vector space model.

 

Does anyone have any code handy to do this, or can anyone point me to resources where I can learn it myself? I've been struggling to find stuff.

 

Thank you all!

Valued Guide
Posts: 505

Re: Vectors / Vector Space Model

I know nothing about this topic, but a quick goggle search led to a couple of R packages and since SAS now integrates through IML to R you can use these packages from SAS

 

see

library(RNewsflow)

https://cran.r-project.org/web/packages/RNewsflow/vignettes/RNewsflow.html

 

https://cran.r-project.org/web/packages/jmotif/README.html

Super User
Posts: 9,775

Re: Vectors / Vector Space Model

You should post it at IML forum since it is about Matrix operation.

data have;
Field1='112 1454 122 342';
Field2='122 1343 32';output;
run;
proc iml;
use have;
read all var {field1 field2};
close;
n1=countw(field1);
temp1=scan(field1,1:n1);
n2=countw(field2);
temp2=scan(field2,1:n2);

all=union(temp1,temp2);

new_field1=t(element(all,temp1));
new_field2=t(element(all,temp2));
print new_field1[r=all],new_field2[r=all];

quit;


Occasional Contributor
Posts: 5

Re: Vectors / Vector Space Model

@Ksharp

 

Thanks for the code.

 

There's one hiccup I can't work out. 

 

If document 1 equals "2 22 4 42"

and document 2 equals "2 4"

 

The vectors for document 2 ticks 1 for all four codes, not just 2 and 4. 

 

I've modified the code slightly, perhaps I did something but I don't think so:


data have;
Field1='4 2 22 42';
Field2='4 2';output;
run;
proc iml;
use have;
read all var {field1 field2};
close;
n1=countw(field1);
temp1=scan(field1,1:n1);
n2=countw(field2);
temp2=scan(field2,1:n2);

all=union(temp1,temp2);

vector1=t(element(all,temp1));
vector2=t(element(all,temp2));
v1 = sqrt(sum(vector1));
v2 = sqrt(sum(vector2));
dotproduct = vector1` * vector2 ;
similarity = dotproduct / (v1*v2) ;
print vector1[r=all],vector2[r=all], v1, v2, dotproduct, similarity;

quit;

 

Any suggestions would be appeciated! Thanks so much!

Super User
Posts: 9,775

Re: Vectors / Vector Space Model

What if your data look like the following, what you gonna do ?

 
If document 1 equals "2 22 4 42 142 "
and document 2 equals "2 4"


Super User
Posts: 9,775

Re: Vectors / Vector Space Model

OK. Assuming I understand what you mean.

 

 


data have;
Field1='2 22 4 42';
Field2='2 4';output;
run;
proc iml;
use have;
read all var {field1 field2};
close;
n1=countw(field1);
temp1=scan(field1,1:n1);
n2=countw(field2);
temp2=scan(field2,1:n2);

all=union(temp1,temp2);

newfield1=j(1,ncol(all));
newfield2=j(1,ncol(all));

 do j=1 to ncol(all);
   temp=all[j];
   t=substr(temp,1:length(temp),1); 
   newfield1[j]=all(element(t,temp1));
   newfield2[j]=all(element(t,temp2));
 end;
 
want=newfield1//newfield2;
mattrib want r={newfield1 newfield2} c=all l='';
print want;
quit;
Ask a Question
Discussion stats
  • 5 replies
  • 300 views
  • 0 likes
  • 3 in conversation