We may have bitten more than we could chew, folks.
An Amazon engineer told me that when he heard what I was trying to do with Ars headlines, the first thing he thought was that we had chosen a deceptively difficult problem. He warned that I had to be careful about setting my expectations properly. If this was a real business problem … well, the best thing he could do was propose to reformulate the problem from “good or bad headline”
This statement was the most family-friendly and concise way to frame the outcome of my four-week, part-time crash course in machine learning. At the moment, my PyTorch cores are not as much torches as they are garbage fires. Accuracy has improved somewhat thanks to professional intervention, but I’m nowhere near using a workable solution. Today, as I am reportedly going on vacation to visit my parents for the first time in over a year, I sat on a sofa in their living room and worked on this project and accidentally started a model training job locally on the Dell laptop I had with me – with a 2.4 GHz Intel Core i3 7100U CPU – instead of in the SageMaker copy of the same Jupyter notebook. Dell locked itself so hard that I had to pull out the battery to restart it.
But hey, if the machine does not necessarily learn, it’s at least me. We’re almost at the end, but if this was a classroom assignment, my grade on the transcript would probably be “Incomplete.”
The gang is trying some machine learning
To summarize: I got the pairs of headlines used for Ars articles over the last five years with data about the A / B test winners and their relative clickthrough rates. Then I was asked to use Amazon Web Services’ SageMaker to create a machine learning algorithm to predict the winner in future pairs of headlines. I ended up going down some ML dead ends before consulting various Amazon sources for some much needed help.
Most of the pieces are in place to complete this project. We (more precisely, my “call a friend at AWS” lifeline) were successful with different modeling methods, even though the degree of accuracy (just north of 70 percent) was not as definitive as one would like. I have enough to work with to produce (with a little extra elbow grease) a distributed model and code to run predictions on pairs of headlines if I encrypt their notes and use the algorithms that were created as a result.
But I have to be honest: my efforts to reproduce that work both on my own local server and on SageMaker have fallen flat. In the process of fumbling through the complications of SageMaker (including forgetting to close notebooks, run automated learning processes that I was later recommended were for “corporate customers” and other misconceptions), I have burned through more AWS budget than I would be comfortable to spend on an unfunded adventure. And while I understand intellectually how I can distribute the models that have been the result of all this futzing around, I am still debugging the actual implementation of that distribution.
If nothing else, this project has become a very interesting lesson in every way machine learning projects (and the people behind them) can fail. And mistakes this time began with the data itself – or even with the question we chose to ask it.
I can still get a working solution out of this effort. But in the meantime, I will share the dataset on my GitHub that I worked on to provide a more interactive component to this adventure. If you are able to get better results, remember to join us next week to mock me in the live setup of this series. (More details on that at the end.)
After several iterations of setting the SqueezeBert model we used in our redirected attempt to train for headlines, the resulting set consistently gained 66 percent accuracy in testing – slightly less than the previously proposed promise of over 70 percent.
This included attempts to reduce the size of the steps taken between learning cycles to adjust inputs – the “learning frequency” hyperparameters used to avoid over- or undermounting of the model. We significantly reduced the learning rate, because when you have a small amount of data (as we do here) and the learning rate is set too high, it will basically give greater assumptions when it comes to the data set’s structure and syntax. Reduction that forces the model to adjust these leaps to small baby steps. Our initial learning rate was set at 2×10-5 (2E-5); we rattled it down to 1E-5.
We also tried a much larger model that had been trained on a huge amount of text, called DeBERTa (decoding enhanced BERT with disentangled attention). DeBERTa is a very sophisticated model: 48 Transformer layers with 1.5 billion parameters.
DeBERTa is so fancy that it has surpassed people in comprehension tasks in natural language in the SuperGLUE reference – the first model to do so.
The resulting distribution package is also quite hefty: 2.9 gigabytes. With all the extra machine learning booklet, we got back up to 72 percent accuracy. Considering that DeBERTa is supposedly better than a human when it comes to discovering meaning in the text, this accuracy, as a well-known nuclear power plant operator once said, is “not great, not terrible.”
Implementation death spiral
On top of that ticking clock. I needed to try to get my own version up and running to test out real data.
An attempt at a local distribution did not go well, especially from a performance perspective. Without a good GPU available, the PyTorch jobs that ran the model stopped, and the endpoint literally crashed my system.
So I went back to trying to distribute on SageMaker. I tried running the smaller SqueezeBert model job on SageMaker alone, but it quickly became more complicated. Training requires PyTorch, the Python machine learning framework, and a collection of other modules. However, when I imported the various Python modules required for the SageMaker PyTorch kernel, they did not match despite updates.
As a result, parts of the code that worked on my local server failed, and my efforts were stuck in a morass of addictive entanglement. It turned out to be a problem with a version of the NumPy library, except when I forced a reinstall (
pip uninstall numpy,
pip install numpy -no-cache-dir), the version was the same, and the error persisted. I finally got it fixed, but then I was met with another error that severely stopped me from running the training job and asked me to contact customer service:
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.
To complete this effort, I had to get Amazon to increase my quota – not something I had expected when I started plugging away. It is a simple solution, but troubleshooting the module conflicts ate up most of the day. And the clock struck me as I tried to side-step using the pre-built model my expert help provided, and distributed it as a SageMaker endpoint.
This effort is now in extra time. This is where I would have discussed how the model managed to test against recent headline pairs – if I ever got the model to that point. If I can finally do it, I’ll post the outcome in the comments and in a note on my GitHub page.