"Ideas are the fundamental way in which information is conveyed in written text. This research investigates the discovery and extraction of ideas from corpuses of scientiic literature. There are several elements to this work: (1) the functional definition of ideas; (2) the computation of novel ideas; (3) the representation of ideas; (4) the construction of a ground truth dataset; and (5) the use of citations as an idea container.
Ideas are defined as a <problem, solution> pair, where the problem and solution are represented by noun phrases, or a sequence of words. As a result of this, the task of idea detection is broken down to problem and solution extraction. The task of idea extraction is similar to Named Entity Recognition (NER), where the problems and solutions may be seen as special entities. These techniques worked well although the results contained a lot of noise that need to be removed.
Automatic idea generation was conducted using a dataset from the Journal of Science. Old ideas were defined as the existing <problem, solution> pairs in the same abstract and new ideas were generated by predicting new links between problems and solutions that do not occur together in one abstract. Evaluation was performed using metrics that are widely used in information retrieval. The F1 scores (higher than 0.90) provides good evidence that the proposed method is capable of generating useful ideas.
A ground truth data set that contained <problem, solution> pairs was constructed from the publications of the International Conference on Neural Information Processing Systems and the Journal of Machine Learning Research. This data was annotated by human volunteers, and it was used for training idea detection models using Conditional Random Field (CRF) and Long-short Term Memory (LSTM). To evaluate the performance of the models, the precision and recall were computed.
Idea analysis was studied by analyzing citations, which are considered to be containers for ideas. Word vectors were used to represent the citations for the purpose of classifying citation sentiment, and a method was developed to measure the sequence of citation sentiment. This method for analyzing internal citation sentiment sequence worked well (with F1 measure 0.86)."
|