This is a deep-learning tool to predict the location of a Twitter user based solely on the text content of his/her tweets without any other form of metadata.
The Twitter Geolocation Predictor is a Recurrent Neural Network classifier. Every training sample is a collection of tweets labeled with a location (e.g. country, state, city, etc.). The model will tokenize all tweets into a sequence of words, and feed them into an Embedding Layer. The embeddings will learn the meaning of words and use them as input for two stacked Long-Short Term Memory layers. A Softmax fully-connected layer at the end yields the classification result.
- Python 3.5
- tensorflow
- keras
- nltk
- pandas
- numpy
- sqlalchemy
- sklearn
- psycopg2
Clone the repository and install all the dependencies using pip.
$ git clone [email protected]:jmatias/uiuc-twitter-geolocation.git
$ cd uiuc-twitter-geolocation
$ sudo pip3 install -r requirements.txt
This will install the latest CPU version of Tensorflow. If you would like to run on a GPU, follow the Tensorflow-GPU installation instructions.
The tool comes with a built-in dataset of ~430K users located in the U.S. (~410K for training, ~10K for development and ~10K for testing). To train a model using this dataset, run the train.py sample script.
Note: The dataset has a size of approximately 2.5GB.
$ python3 train.py --epochs 5 --batch_size 32 --vocab_size 20000 --hidden_size 100 --max_words 100 --classifier state
Using TensorFlow backend.
Downloading data from https://dl.dropbox.com/s/ze4ov5j30u9rf5m/twus_test.pickle
55181312/55180071 [==============================] - 11s 0us/step
Downloading data from https://dl.dropbox.com/s/kg09i1z32n12o98/twus_dev.pickle
57229312/57227360 [==============================] - 12s 0us/step
Downloading data from https://dl.dropbox.com/s/0d4l6jmgguzonou/twus_train.pickle
2427592704/2427591168 [==============================] - 486s 0us/step
Building model...
Hidden layer size: 100
Analyzing up to 100 words for each sample.
Building tweet Tokenizer using a 20,000 word vocabulary. This may take a while...
Tokenizing tweets from 59,546 users. This may take a while...
Training model...
Train on 50000 samples, validate on 9546 samples
Epoch 1/1
1664/50000 [..............................] - ETA: 3:59 - loss: 3.8578 - acc: 0.0950 - top_5_acc: 0.2536
You can also try using this data from your own source code.
In [1]: from twgeo.data import twus_dataset
Using TensorFlow backend.
In [2]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus_dataset.load_state_data()
In [3]: x_train.shape
Out[3]: (410336,)
In [4]: y_train.shape
Out[4]: (410336,)
In [5]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus_dataset.load_state_data(size='small')
In [6]: x_train.shape
Out[6]: (50000,)
In [7]: y_train.shape
Out[7]: (50000,)
Tweet Text | Location |
---|---|
Hello world! This is a tweet. <eot> This is another tweet. <eot> | Florida |
Going to see Star Wars tonite! | Puerto Rico |
Pizza was delicious! <eot> I'm another tweeeeeet <eot> | California |
Given a raw dataset stored in a CSV file like the one shown above, we can preprocess said data using twgeo.data.input.read_csv_data()
. This function will:
- Tokenize the tweet text.
- Limit repeated characters to a maximum of 2. For example: 'Greeeeeetings' becomes 'Greetings'.
- Perform Porter stemming on each token.
- Convert each token to lower case.
The location data may be any string or integer value.
import twgeo.data.input as input
tweets, locations = input.read_csv_data('mydata.csv', tweet_txt_column_idx=0, location_column_idx=1)
from twgeo.models.geomodel import Model
from twgeo.data import twus
# x_train is an array of text. Each element contains all the tweets for a given user.
# y_train is an array of integer values, corresponding to each particular location we want to train against.
x_train, y_train, x_dev, y_dev, x_test, y_test = twus.load_state_data(size='small')
# num_outputs is the total number of possible classes (locations). In this example, 50 US states plus 3 territories.
# time_steps is the total number of individual words to consider for each user.
# Some users have more tweets then others. In this example, we are capping it at a total of 500 words per user.
geoModel = Model(batch_size=64)
geoModel.build_model(num_outputs=53, time_steps=500,vocab_size=20000)
geoModel.train(x_train, y_train, x_dev, y_dev, epochs=5)
geoModel.save_model('mymodel')
In [1]: from twgeo.models.geomodel import Model
Using TensorFlow backend.
In [2]: from twgeo.data import twus_dataset as twus
In [3]: x_train, y_train, x_dev, y_dev, x_test, y_test = twus.load_state_data(size='small')
In [4]: geoModel = Model()
In [5]: geoModel.load_saved_model('mymodel')
Loading saved model...
In [6]: geoModel.predict(x_test)
Out[6]: array(['CA', 'FL', 'NY', ..., 'TX', 'MA', 'KY'], dtype=object)
The built-in TWUS dataset was used to train US State and US Census Region classifiers. Using a hidden layer size of 300 neurons, timestep window of 500 words and a vocabulary size of 50,000 words, the model achieves the following results.
Classification Task | Test Set Accuracy | Test Set Accuracy @ 5 |
---|---|---|
US Census Region | 73.95% | N/A |
US State | 51.44% | 75.39% |