Data Science Altitude for This Article: Camp Two. So, all the pieces on the chessboard are in their strategic locations. We’ve identified a set of papers from which we want to identify thematic intent, taking The Federalist Papers directly from the Project Gutenberg site. We’ve cleaned them up, removing common words and metadata, and have formatted them into a DocumentTermMatrix. We then pulled that object into a Latent Dirichlet Allocation (LDA) model as defined in the topicmodels package and took a look at some of the high-level mathematics involved and the resulting object’s composition.
Data Science Altitude for This Article: Camp Two. Previously, we created a DocumentTermMatrix for the express purpose of its fitting in nicely with our upcoming LDA model formation. Here, we’ll discuss LDA in a bit of detail and dive into our findings. But first, a brief refresher on the format and content of the DT matrix. The word count for the first ten stemmed words out of the first eight documents and their aggregation for documents 9-85 are seen below:
Data Science Altitude for This Article: Camp Two. For our next post, we tackle something a little more difficult than in the past posts: the use of Probabilistic Topic Modeling for thematic assessment in literature. Got a lot of documents that you’re trying to condense down into manageable themes? From an academic perspective, you might be interested in thematic constructs in one or more of Mark Twain’s works, perhaps. Or from a socio-political perspective - and something a little more present-day - you might want to identify recurring themes in political speeches and see how they track among candidates for a particular office.