I'm working on a tool[0] using LSA [1] to mine the emails from Hacking Team so that people could use it to get more relevant results compared to what wikileaks has available (a lot of stuff to sift through to not know how each message might be related to ones queries outside of just mentioning the word in an email).
Right now I have to break up the term-message matrices by person to do before doing partial eigenvalue decomposition, to generate inverse(sigma) * transpose(u) and inverse(sigma) * transpose(v) and the lower dimensional space representation of each message for each message, but it would be cool to not have to do that if I had more computing power available (a friend let me use his 12 thread/ 6 core machine which has helped a lot while building things).
Ideally it should it hosted somewhere eventually, because the project it self might be a bit complicated/tedious for most people to set up themselves, that would allow people to search it using these indexes as well as enabling (independent) journalists to be able sift through everything in an arguably better way.
Right now I have to break up the term-message matrices by person to do before doing partial eigenvalue decomposition, to generate inverse(sigma) * transpose(u) and inverse(sigma) * transpose(v) and the lower dimensional space representation of each message for each message, but it would be cool to not have to do that if I had more computing power available (a friend let me use his 12 thread/ 6 core machine which has helped a lot while building things).
Ideally it should it hosted somewhere eventually, because the project it self might be a bit complicated/tedious for most people to set up themselves, that would allow people to search it using these indexes as well as enabling (independent) journalists to be able sift through everything in an arguably better way.
[0] https://github.com/cinquemb/hackedteam-email-index-mining
[1] https://en.wikipedia.org/wiki/Latent_semantic_analysis#Deriv...