Throughout the last five days I have been sprinting forward trying to really understand how to use the GPT-2 model. It took me a lot longer to dig into this one than my normal mean time to solve. Documentation on these things is challenging because of two distinct factors that increase complexity. First, the instructions are typically not purely step by step for the read. You have to have some understanding of what you are doing to be able to work the instructions to conclusion. Second, the instructions happened at a specific point in time and the dependencies, versions, and deprecations that have happened since are daunting to overcome. At the heart of the joy that Jupyter Notebooks create is the ability to do something rapidly and share it. Environmental dependencies change over time and that once working notebooks slowly drift away from being useful to being a time capsule of now perpetually failing code. That in some ways is the ephemeral nature of the open source coding world that is currently expanding. Things work in the moment, but you have to ruthlessly maintain and upgrade to stay current on the open source wave of change.
My argument above is not an indictment of open source and dependencies within code on versions and libraries. Things just get real over time as a code base that was once bulletproof proves to be dependent on something that was deprecated. Keep in mind that my journey to use the GPT-2 model included working with a repository that was published on GitHub just 15 months ago with very limited documentation. The file with developer instructions did not include a comprehensive environment list. I was told that this is why people build Docker containers that can be a snapshot in time deployed again and again to essentially freeze time. That is now how I work in real time or code when I’m doing things actively developing. My general use case is to sit down and work with the latest version of everything. That might not be a good idea as code is generally not assumed to be future proof. An environmental dependency file would help be a signpost for future developers to know where exactly things stood when this code base was shared via repository to GitHub.
Really digging into the adventure of digging into the code base for the last five days has been fun and full of learning. Digging into something for me involves opening the code up in Microsoft Visual Studio Code and trying to understand each block that was shared. The way I learned to tinker with and edit Python code was one programing debugging session at a time. I’ll admit that learning was a lot easier in a Jupyter Notebook environment. That allows you to pretty much run each section one after another and see any errors that are spit out so you can work to debug the code to get to that perfect possible future of working code. Oh that resplendent moment of working code where you move on to the next problem. It is a wonderful feeling of accomplishment to see code work. It is a supremely frustrating feeling to watch errors flood the screen or even worse to get nothing in return beyond obvious failure. Troubleshooting general failure is a lot harder than working to resolve a specific error. Right now between the two sessions of my Google Chrome browser I have maybe 70 tabs open. On reboot it is so bad that I end up having to go to browser settings, history, and recently closed to bulk reopen this massive string of browser tabs that at one point were holding my attention.
One of the best features I learned about in GitHub was to search for recently updated repositories. To accomplish that I searched for what I was looking for then sorted the results by last update. Based on the problems described above that type of searching was highly useful to learn the right environmental setup necessary to do the other things I wanted in a Google Colab notebook. On a side note when somebody published to GitHub using a notebook from Google Colab enough bread crumbs exist to find interesting use cases by searching for “colab” plus whatever you are looking for from the main page of GitHub. Out of pure frustration on learning how to set up the environment to get going I used searches filtered to most recently updated for “colab machine learning” and “colab gpt” to get going. Out of that frustration I learned something useful about just looking around to see what people are actively working on and taking a look at what they are actively sharing on GitHub. My searching involved looking at a lot of code repositories that did not have any stars, reviews, or interactions. As my GPT skills improve I’ll make suggestions for some of those repositories on how to get their code bases working again now that a lot of them are getting massive numbers of errors that essentially end up concluding in, “ModuleNotFoundError: No module named ‘tensorflow.contrib’.” That error is truly deflating when it appears. Given how important it is to a lot of models and code I probably would have developed handling for it in the base TensorFlow given that it was intentionally deprecated.
My next big adventure will be to take the environmental setup necessary to get the GPT-2 model working and work out the best method to ingest my corpus of 20 years worth of my writing and see what it spits out as the next post. That has been my main focus in learning how to use this model and potentially even learning how to use the GPT-3 model that was released earlier this week by OpenAI. Part of the fun of doing this is not messing with it locally on my computer and creating a research project that cannot be reproduced. Within what I’m trying to do the fun will be releasing the Jupyter notebook and the corpus file to allow other researchers to build more complex models based on my large writing database or other researches could verify the results through reproducing the steps taking the notebook. That is the really key part here of the whole thing. Giving somebody the tools to freely reproduce the research on Google Colab without any real limitations is a positive setup forward in research quality. Observing a phenomenon and being able to describe it is great. Being able to reproduce the phenomenon being described is how scientific method can be applied to the effort.
Getting back into the groove of writing and working on things really just took a real and fun challenge to kickstart. Having a set of real work to complete always makes things a little bit easier and clearer. Instead of thinking about the possible you end up thinking about the pathing to get things done. Being focused on inflight work has been a nice change of direction. Maybe I underestimated how much a good challenge would improve my quarantine experience. Things have been a little weird since March and the quarantine came into being and it is about to be June on Monday. That is something to consider in a moment of reflection.
I have been actively working in the Google Colab environment and on my Windows 10 Corsair Cube to really understand the GPT-2 model. My interest in that has been pretty high the last couple of days and I have been working locally in Windows and after that became frustrating I switched over to using GCP hardware via the Google Colab environment. One of the benefits of switching over is that instead of trying to share a series of commands and some notes on what happened I can work out of a series of Jupyter notebooks. They are easy to share, download, and mostly importantly to create from scratch. The other major benefit of working in the Google Colab environment is that I can dump everything and reset the environment. Being able to share the notebook with other people is important. That allows me to actively look at and understand other methods being used.
One of the things that happened after working in Google Colab for a while was the inactivity timeouts made me sad. I’m not the fastest Python coder in the world. I frequently end up trying things and moving along very quickly for short bursts that are followed by longer periods of inactivity while I research an error, think about what to do next, or wonder what went wrong. Alternatively, I might be happy that something went right and that might create enough of a window that a timeout occurs. At that point, the Colab environment connection to the underlying hardware in the cloud drops off and things have to be restarted from the beginning. That is not a big deal unless you are in the middle of training something and did not have proper checkpoints saved off to preserve your efforts. I ended up subscribing to Google’s Colab Pro which has apparently faster GPUs, longer runtimes (less idle timeouts), and more memory. At the moment, the subscription costs $9.99 a month and that seems reasonable to me based on my experiences so far this week.
Anyway —- I was actively digging into the GPT-2 model and making good progress in Google Colab and then on May 28 the OpenAI team dropped another model called GPT-3 with a corresponding paper, “Language Models are Few-Shot Learners.” That one is different and has proven a little harder to work with at the moment. I’m slowly working on a Jupyter notebook version.
Throughout the last few days I have been devoting all my spare time to learning about and working with the GPT-2 model from OpenAI. They published a paper about the model and it makes for an interesting read. The more interesting part of the equation is actually working with the model and trying to understand how it was constructed and working with all the moving parts. My first efforts were to install it locally on my Windows 10 box. Every time I do that I always think it would have been easier to manage in Ubuntu, but that would be less of a challenge. I figured giving Windows 10 a chance would be a fun part of the adventure. Giving up on Windows has been getting easier and easier. I actually ran Ubuntu Studio as my main operating system for a while with no real problems.
My training data set for my big GPT-2 adventure is everything published on my weblog. That includes about 20 years of content that spans. The local copy of the original Microsoft Word document with all the formatting was 217,918 kilobytes whereas the text document version dropped all the way down to 3,958 kilobytes. I did go and manually open the text document version to make sure it was still readable and structured content.
The first problem is probably easily solved and it related to a missing module named “numpy”
PS F:\GPT-2\gpt-2-finetuning> python encode.py nlindahl.txt nlindahl.npz
Traceback (most recent call last):
File “encode.py”, line 7, in
import numpy as np
ModuleNotFoundError: No module named ‘numpy’
Resolving that required a simple “pip install numpy” in PowerShell. That got me all the way to line 10 in the encode.py file. Where this new error occurred:
PS F:\GPT-2\gpt-2-finetuning> python encode.py nlindahl.txt nlindahl.npz
Traceback (most recent call last):
File “encode.py”, line 10, in
from load_dataset import load_dataset
File “F:\GPT-2\gpt-2-finetuning\load_dataset.py”, line 4, in
import tensorflow as tf
ModuleNotFoundError: No module named ‘tensorflow’
Solving this one required a similar method in PowerShell “pip install –upgrade pip install https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl” that also included a specific path to tell is where to get TensorFlow.
I gave up on that path and went a different route…
Getting the GPT-2 model setup on this Windows 10 machine was not as straightforward as I had hoped it would be yesterday. Python got upgraded, Cuda got upgraded, cuDNN got installed, and some flavor of the C++ build tools got installed on this machine. Normally when I elect to work with TensorFlow I boot into an Ubuntu instance instead of trying to work with Windows. That is where I am more proficient at managing and working with installations and things. I’m also a lot more willing to destroy my Ubuntu installation and spin up another one to start whatever installation steps I was working on again from the start in a clean environment. My Windows installation here has all sorts of things installed on it and some of them were in conflict or something with my efforts to get GPT-2 running. In fairness to my efforts yesterday, I only had a very limited amount of time after work to figure it all out. Time ran out and installation had occurred via the steps on GitHub, but no magic was happening. Time ran out and that was a truly disappointing scenario to have happened.