Tuesday, October 25, 2011

Active Info Graph

Graph is the beautiful data structure both in Computer Science and Maths, I love this when first got some basic ideas in school.
One of the best ways to innovate is  forgetting all things you have learned in school or trap yourself in the mind someone  already make it happen ?

Here some academic writings (and maybe Google, or big guys implemented this ???)
http://en.wikipedia.org/wiki/Tf%E2%80%93idf (in Lucene http://lucene.apache.org/java/docs/index.html)

what I can dream
Small World Phenomenon (http://introcs.cs.princeton.edu/java/45graph/)

Case study
In the passive info world, I have to google to search for some useful things
E.g: http://www.vietnamworks.com/jobseekers/searchresults.php?search=true&industry=35&gclid=CM_Y3MSz7JICFQgaewodSE5zwg&lang=2
Captured from vietnamworks

Can this info actively come to me ? 

Sunday, October 16, 2011

Data Scientist - Books, Links, Papers, Tools, Projects,

on the way to prepare & study for new job, new trends after the post web 2.0 era. I still think about what should I do, study, research , blah.. blah ... to be a Data Scientist , ya truly science job.

In the trend where the data generated from massive users, tons of data is everywhere. Blog, Facebook, YouTube, Twitter, ...
We have to deal with them everyday. Your physical brain is designed to processing a lot of news, information, work ,,.. at same time for filter what is useful information , the knowledge you should capture and then the Wisdom (http://www.systems-thinking.org/dikw/dikw.htm)
=>Stress, overloaded, ... or the limit of biological brain.

On the way to implement my idea "My Second Brain" project http://code.google.com/p/my-second-brain/

As the name, it should help me processing tons of email, blogs, RSS , local news to find the keywords , the trends. That can save me time manually reading, classifying , tagging, the key information. So I can focus all my energy to do cool things, making decisions to improve my skills, also  my career.,
to change the world, at least I should change my life first, and then share them for all.

First, how to extract the content of local news, and rank the best keywords. ==> http://code.google.com/p/boilerpipe/

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.


OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.

Fifth, the Google Cloud & some tools
Hooking to browsing job, http://code.google.com/chrome/extensions/overview.html. Private cloud storage, cheap and cool, the Gmail https://mail.google.com/

Sixth, the Jetty, how your personal service running http://jetty.codehaus.org/jetty/ , http://code.google.com/p/i-jetty/

Seventh, mobile way how information is collected and consumed, http://www.phonegap.com/about

Eighth, finally, visualization your personal information http://mbostock.github.com/protovis/ ,http://thejit.org/ , https://github.com/mbostock/d3

The big picture in one photo