Keeps a copy of a user's tweets in a MySQL database.
twitstash requires a PHP 5.3 installation with the following extensions installed:
curl
- Required by thetwitteroauth
library for making HTTP requestsbcmath
- Deals with Twitter's 64-bit IDs represented as stringspdo
- Provides a database interfacepdo-mysql
- PDO driver for MySQL
Additionally, a working MySQL installation will be needed. Any recent version
will work. (I personally had issues inserting emoji characters on MySQL 5.1.51,
but I didn't pursue it too far. If you encounter a similar problem, an easy hack
is to set the text
column to VARBINARY(140)
instead of VARCHAR(140)
.)
-
git clone --recursive https://github.com/smitelli/twitstash.git && cd twitstash
-
Ensure you have an empty database created for the stored tweets, and that there is a user account that can update this database.
-
mysql --user=YOUR_USER --password YOUR_DATABASE < twitstash.sql
-
cp config.ini-sample config.ini
-
Edit
config.ini
to suit your fancy. You'll have to put your own Twitter and MySQL authentication in there. -
chmod a+x twitstash.sh
-
./twitstash.sh
On the initial run, twitstash will incrementally download older and older tweets until Twitter stops providing them. At the time of this writing, the most recent 3,200 tweets can be retrieved. After the initial run, twitstash will stop making API requests once it encounters a tweet that it has previously stored.
The shell script is designed to never output anything, so you can add it in a
cron job without worrying about spamming root's inbox with junk. A file called
debug.log
will be created (and appended) by the shell script. There's
generally nothing useful in that file.
Data collected by twitstash is kept in three tables. The following is a list of columns and what types of data you can expect in each:
id
- Twitter's ID number for this tweet.created_at
- The date and time this tweet was created. This time is stored relative to thetimezone
setting in config.ini.text
- The raw text of this tweet. There may be t.co URLs in here -- the expanded URL is stored in theurls.expanded_url
column.source
- The URL and name of the client that created this tweet, typically expressed as an HTML anchor tag.reply_id
- The ID number of the tweet that this tweet is replying to. 0 if this tweet is not a reply.rt_id
- The ID number of the tweet that was retweeted. 0 if this tweet is not a retweet.place_id
- A 16-digit hex string identifying the "place" associated with this tweet. JOINing this column onplaces.id
will return a lot more info. Also see https://dev.twitter.com/docs/api/1.1/get/geo/id/%3Aplace_idlatitude
- The exact point latitude reported by the client. North is positive, south is negative. 0 if this tweet had no geo data.longitude
- The exact point longitude reported by the client. East is positive, west is negative. 0 if this tweet had no geo data.touched
- Used internally to determine if any tweets have "disappeared" and should be marked as deleted. Always 0, unless the script died mid-run.deleted
- Estimated date/time a tweet was deleted. The more frequently twitstash is run, the more accurate this column becomes. NULL if this tweet was never deleted (or, at least, it was not deleted when we last checked).
id
- A 16-digit hex string identifying this place.place_type
- One of 'city', 'neighborhood', or 'admin'. There may be more.full_name
- The full text name of this place.country
- The full text name of the country that contains this place.centroid_lat
- Latitude, averaging all points of the bounding box.centroid_lon
- Longitude, averaging all points of the bounding box.
See tweets.latitude
and tweets.longitude
for info on the exact format of the
latitude/longitude pairs.
url
- The shortened t.co link present in a tweet's text.expanded_url
- The URL that the shortened link points to.
This package includes Abraham Williams' twitteroauth
library.
https://github.com/abraham/twitteroauth