published: Oct. 9, 2017, recorded: August 2017, views: 1084
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Random forests are among the most successful methods used in data mining because of their extraordinary accuracy and effectiveness. However, their use is primarily limited to multidimensional data because they sample features from the original data set. In this paper, we propose a method for extending random forests to work with any arbitrary set of data objects as long as similarities can be computed among the data objects. Furthermore, since it is understood that similarity computation between all $O(n^2)$ pairs of objects might be expensive, our method computes only a very small fraction of the $O(n^2)$ pairwise similarities between objects to construct the forests. Our results show that the proposed similarity forest approach is extremely efficient and is also very accurate on a wide variety of data sets. Therefore, this paper significantly extends the applicability of random forest methods to arbitrary data domains. Furthermore, the approach even outperforms traditional random forests on multidimensional data. In many cases, the similarity matrices learned from arbitrary applications are noisy, because of the difficulty in estimating similarity values between pairs of objects. Similarity forests are very robust to errors in classification. In many practical settings, the similarity values between objects are incompletely specified because of the difficulty in collecting such values. In such cases, the similarity forest approach can be naturally extended to a partially specified similarity matrix.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !