Introduction
In this article, I am going to show you how one can easily build a custom sentiment analysis model from scratch, transform it into a cool-looking GitHub project, and publish it as a PyPI package.
I assume that you have basic knowledge of sentiment analysis and NLP. Let’s dive into it,
Select your dataset
What is the first thing you need to build any data science project? It’s data. So, select your desired dataset. Don’t worry if you haven’t thought about it till now, you are free to select any of the sentiment analysis dataset available in kaggle. For my project, I selected the popular Twitter sentiment analysis dataset which contains 1.4 million labeled tweets. Now that you have the desired data, let’s see what to do with it next.
Build your corpus
In this section, I will do some preprocessing on data and build a text corpus out of this data. For this, you need to split the input sentence into words and clean the input data a little bit. For the sake of simplicity, I will only remove punctuations from our tweets for cleaning and then build a corpus out of the dataset. Here is the code to implement this,
Image by author
Now, you have a text corpus which is a list of words made out of the input samples. Each distinct word in your corpus constitutes the input vocabulary. Next, you can represent the individual words in your corpus using a number such that each distinct word refers to a number. For example, if the word great is indexed as word number 100, all the occurrences of the corresponding word will be substituted by the number 100.
Image by author
So you have represented all the words in the corpus using some number, what next? the number of tokens in a tweet can be anywhere between one to a hundred, so our input sentences have variable length. Next, you have to extend each of these sentences into a predefined length, this is achieved by filling 0s to the left or right end of each tokenized sentence. This process is called padding, you can implement this easily using Keras tokenizer as shown here.
Image by author
Embedding
Text embeddings are a form of word representation in NLP in which synonymically similar words are represented using similar vectors which when represented in an n-dimensional space will be close to each other.
image by author
Embedding based python packages use this form of text representation to predict text sentiments. This leads to better model performance.
There is a wide range of pre-trained embeddings to choose from, some of the most popular ones are:
For my project, I have decided to choose Glove Twitter (200 Dimension) which is pre-trained on the Twitter corpus. In this, each word is represented using a 200-dimensional vector. I have used torchtext module to load and process text embeddings into an embedding matrix. Here is the code,
Model Architecture
Now you have the text data transformed and represented into the required format, next you need a model. In this section, I will build a model using Generative recurrent neural networks (GRU). GRUs is proved to work well with sequence problems. If you’re not very familiar with it, refer to this nice article which demonstrates how it works.
Training and evaluation
Now you have everything you need, next let’s split the data into train/validation/test and train our model. After training, the best model weight is saved for inference.
The complete training notebook is available here.
Now I have successfully trained a model, what next?
You cannot present a notebook as a project to someone and expect them to get impressed with it. In upcoming sections of this article, I will show you how to convert this notebook into a super cool-looking python package in less than 10hrs time.
Create a project repository
First, head over to GitHub and create a repository with your desired project name. For mine, I used this keyword research tool to rephrase my project name for SEO.
Now you have a project repository, let’s add a good looking Readme for it with a logo for your project. The Readme should be precise and neatly formatted. There are many free logo designing tools available online, you can use any one of these to get the job done. Remember not to spend a lot of time on this! You can also add badges and gifs if you want to make it a little more attractive.
Setup pre-commit hooks
When you create an open-source project, clean code and documentation are as important as anything because you need people to easily understand what you have done. For this, I usually use black and flake8 for code formatting. Even if you have installed these packages you might forget to run these before each commit, hence we have pre-commit hooks. Once installed and initialized, this will take care of the problem and make sure that code formatting is done before making each successful commit. So, let’s see how to set it up
- Install pre-commit : $pip install pre-commit
- Define pre-commit-config.yaml like this.
- Setup config for black.
- Setup config for flake8.
- Run $pre-commit install to install hooks to your .git/ directory.
Github Actions
Github actions make it easy to automate workflows, build tests, and deploy code on Github. For this project, let’s make use of it by adding some tests that it will perform automatically before merging anything to the master branch. I added tests for checking if the code is formatted using black and flake8. You can add any more tests like this, refer here to know more about Github actions. Now, let’s add our tests
Add this yml file to config your GitHub actions, I have made a test to see if incoming PR to master branch is formatted using black and flake8 for my project. You can add more actions as you like.
Image by author
Refactor your code
In this section, I will show you how to refactor your code using object-oriented programming concepts. I assume that you have basic knowledge of OOP concepts. You should have a parent class which you will use to initialize your python package. All the functionalities of your package should be defined here. I structured my project like below,
Image by author
Once you’re done with this, you can start adding comments to your code and make it more readable. Below each function comment down the aim, arguments, and expected output for the function. This drastically improves the quality of your project.
Upload model weights
You have the model weights on your local computer, for making it public you can upload it as a release file in your repository. When initiated the model will fetch the weights from here and use it for making predictions, for this you can use model_zoo in PyTorch or equivalent methods in TensorFlow.
Image by author
After uploading model weights, test your project with it and make sure that everything is working just as intended.
Wrap it up
Now, let’s wrap this up and build a PyPI package for our project. Trust me it’s no big deal! Just follow these steps
- Setup.py file: First, you need to configure the setup file. setup.py is the build script for setup tools. It tells setup tools about your package (such as the name and version) as well as which code files to include. You can take my setup file as an example, for further reference see the docs.
- Manifest.in: For including extra files with your source code such as pickled tokenizer file, you should add those into this file as I have done here. For example, to include README.md file add,
Include README.md
For more information, refer to the docs.
- Requirements.txt: Add all the required dependencies of your project to this file, you can do it easily using
$pip freeze > requirements.txt
- __init__.py: Choose a release version number and add here like this.
Publishing to PyPI
If you don’t have a Pypi account and TestPypi account, it’s time to create one in both.
- First, install the twine package.
$ pip install twine
- Build your package,
$ python setup.py sdist bdist_wheel
- Check all the required files are present
$ tar tzf twittersentiment-0.0.1.tar.gz
- Check if the files are rendered properly,
$ twine check dist/*
The tests should pass if everything is fine.
- Upload to TestPypi
$ twine upload — repository-url https://test.pypi.org/legacy/ dist/*
Now view the project in testpypi and make sure that everything is fine.
- Publish your package
$ twine upload dist/*
Congrats! You have successfully published your opensource python package to PyPI.
Final thoughts
In this article, I showed you how you can set up a super cool open source project from your simple notebook code. You can check out my project here. Hopefully, this will be useful for your projects.