Using linguistic information as features for text categorization
published: Nov. 26, 2007, recorded: September 2007, views: 4823
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
We report on some experiences using linguistic information as additional features in a classical Vector Space Model. Extracted information of every word like the Part Of Speech and stem, lexical root have been combined in different ways for experimenting on a possible improvement in the classification performance and on several algorithms, like SVM , BBR  and PLAUM . Automatic Text Classification, or Automatic Text Categorization as is also known, tries to related documents to predefined set of classes. Extensive research has been carried out on this subject  and a wide range of techniques are appliable to solve this task: feature extraction , feature weighting, dimensionality reduction , machine learning algorithms and more. Besides, the classification task can be either binary (one out of two possible classes to select), multi-class (one out of set of possible classes) or multi-label (a set of classes from a larger set of potential candidates). In most cases, the latter two can be reduced to binary decisions , as the used algorithm does in our experiments . In order to verify the contribution of the new features, we have combined them to be included into the vector space model by preprocessing the Reuters- 215781 collection, a well known set of data by the research community devoted to text categorization problems .
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !