Data Science Altitude for This Article: Base Camp I’m going to have to put a brief pause on putting content out for the blog. I’ve been offered a position as a Data Scientist and will need to relocate. So there’s going to be a bit where I’m dealing with a move and some time getting proficient with my new employer’s tech stack.
Upon return, I’m going to arrange some thoughts on the job interview process.
Data Science Altitude for This Article: Camp Two. As we concluded our last post, we used Azure ML Studio to attach a Logistic regression model to some data for a hypothetical electrical grid (11 predictor variables). We had a first look at the output which scores whether our response variable as adequately predicted, and promised a dive into the scoring. We’ll get to that, but first I’d like to put together another, competitive model to compare and contrast to our Logistic regression efforts.
Data Science Altitude for This Article: Camp Two. Splitting the Data Azure ML Studio gives us a handy way to split our data and provides some alternatives in doing so. I’ll mark 80% of our data for use in training the model, and 20% to use in scoring it with an eye towards which model is better when encountering new data.
With our split of categories to predict being roughly 60/40, there’s little to gain from ensuring that the division of data into 80% train / 20% test keeps to a consistent 60/40 split along those category percentages.
Data Science Altitude for This Article: Camp Two. Our prior posts set the stage for access to MS Azure’s ML Studio and got us rolling on data loading, problem definition and the initial stages of Exploratory Data Analysis (EDA). Let’s finish off the EDA phase so that in our next post, we can get to evaluating the first of two models we’ll use to forecast stability - or the lack thereof - for a hypothetical electrical grid.
Data Science Altitude for This Article: Camp Two. We left our prior post with covering what’s involved for you to set up a Microsoft Account, a survey of your access options (Quick Evaluation / Most Popular / Enterprise Grade), and a brief and basic tour of the Azure ML Studio Interface.
We’re going to head there shortly and get a big whiff of the drag-and-drop catnip, but first, let’s discuss the particular problem I’d like to throw at it and set the stage for the next several posts.
Data Science Altitude for This Article: Camp Two. While working my way through the code in Wei-Meng Lee’s excellent Python Machine Learning, I ran into Chapter 11, titled Using Azure Machine Learning Studio.
Sporting a drag-and-drop interface, you can tackle many Data Science problems without the need to write a line of code. You’re not absolved from knowing why you’re doing what you’re doing, of course, but you can try out some problem-solving proofs-of-concept in short order.
Data Science Altitude for This Article: Camp One. If you’re a Python developer that has been thrust into an R working environment or an R developer that would like to try out Python packages and methods in the comfort of an already-familiar RStudio IDE, then Reticulate is the package for you. Using rmarkdown, you’re able to knit R and Python code blocks into a unified whole and can refer to objects across the language barrier.
Data Science Altitude for This Article: Sea Level. Just a few words here about upcoming posts - I’m doing some research on the following topics:
The use of the reticulate package in R to integrate R and Python more easily into the same markdown document. Some enhancements have been made with RStudio 1.2 that are of interest. An interactive python session can be invoked, and there’s a naming convention standard for referencing objects in Python chunks that were created in Rmarkdown chunks, and vice-versa.
Data Science Altitude for This Article: Camp Two. So, all the pieces on the chessboard are in their strategic locations. We’ve identified a set of papers from which we want to identify thematic intent, taking The Federalist Papers directly from the Project Gutenberg site. We’ve cleaned them up, removing common words and metadata, and have formatted them into a DocumentTermMatrix. We then pulled that object into a Latent Dirichlet Allocation (LDA) model as defined in the topicmodels package and took a look at some of the high-level mathematics involved and the resulting object’s composition.
Data Science Altitude for This Article: Camp Two. Previously, we created a DocumentTermMatrix for the express purpose of its fitting in nicely with our upcoming LDA model formation. Here, we’ll discuss LDA in a bit of detail and dive into our findings. But first, a brief refresher on the format and content of the DT matrix. The word count for the first ten stemmed words out of the first eight documents and their aggregation for documents 9-85 are seen below: