Came across a great tutorial on the basics of natural language processing (NLP) and classification.
The example in this tutorial uses a Python library called gensim
which (according to its website) is the “the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.” As far as I understand it, it’s a very handy and commonly-used library for NLP.
One important tool/algorithm used for classification is something called the Vector Space Model. Basically, all documents or queries are represented as “vectors of identifiers”, such as an index of words and use the angle (theta) between vectors as a similarity measure (explained further later). Incidentally, if you ranked the similarity of each document to a query, you’d have a search engine.
A common way to represent a written document as a vector is to think about it in terms of Bag of Words (BoW) vectors. For example, if an entire document consists only of the words “dog” and “cat”, then the BoW vector for this document would be [# of ‘dog’, # of ‘cat’].
A very high-level overview of the workflow for NLP using the Vector Space Model is as follows:
Preprocessing -> Create “Bag of Words” vector -> Dimensionality Reduction -> use SVM algorithm for classification
A more detailed workflow is as follows:
Remove stopwords and split on spaces -> Take out rare terms using
gensim
-> Create Bag of Words vectors -> Dimensionality reduction using LSI (Latent Semantic Indexing) model ingensim
(topic vectorization) -> unit vectorization -> finding cosine distance
Several important concepts to be aware of from this tutorial:
-
Vector Space Model
- The Curse of Dimensionality
-
“There are all kinds of terrible things that happen as the dimensionality of your descriptor vectors rises. One obvious one is that as the dimensionality rises, both the time and space complexity of dealing with these vectors rises, often exponentially. Another issue is that as dimensionality rises, the amount of samples needed to draw useful conclusions from that data also rises steeply. Another way of phrasing that is with a fixed number of samples, the usefulness of each dimension diminishes. Finally, as the dimensionality rises, your points all tend to start becoming equidistant to each other, making it difficult to draw solid conclusions from them.”
-
- Similarity in Vector Space
- Euclidean distance (this ignores direction)
- Cosine distance - measuring similarity based on angle between vectors is know as cosine distance, or cosine similarity.
- Unit vectorization - modify the vectors themselves by dividing each number in each vector by that vector’s magnitude. In doing so, all our vectors have a magnitude of 1. This process is called unit vectorization because the output vectors are units vectors.
- Supervised Learning
- Train the algorithm on samples which have the ‘correct’ answer provided with them. The specific supervised learning problem we’re addressing here is called classification. You train an algorithm on labelled descriptor vectors, then ask it to label a previously unseen descriptor vector based on conclusions drawn from the training set.
- Support Vector Machine - SVM is a family of algorithms which define decision boundaries between classes based on labelled training data.
-
“For our ‘dog’ vs. ‘sandwich’ classification problem, we provide the algorithm with some training samples. These samples are documents which have gone through our whole process (BoW vector -> topic vector -> unit vector) and carry with them either a ‘dog’ label or a ‘sandwich’ label. As you provide the SVM model with these samples, it looks at these points in space and essentially draws a line between the ‘sandwich’ documents and the ‘dog’ documents. This border between “dog”-land and “sandwich”-land is known as a decision boundary. Whichever side of the line the query point falls on determines what the algorithm labels it.”
-
“All samples in both training and test sets are labeled. However, in practice, you would build the model on the labeled training set, ignore the labels on the test set, feed them into the model, have the model guess what those labels are, and finally check whether or not the algorithm guessed correctly. This process of testing out your supervised learning algorithm with a training and test set is called cross-validation.”
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|