Data analysis libraries for Python keep getting better and better. pandas
is now on 0.15, scikit-learn
continues to pick up converts, and a number of
visualization libraries have started to emerge: for example, our own baby, ggplot
,
and others like seaborn.
But there's still one subject that's practically synonymous with data analysis
where there aren't any new, killer libraries: databases.
There are some great Python database libraries (SQLAlchemy instantly comes to mind), but none are focused on using Python as a means of exploring and analyzing data.
Introducing db.py
So today we're introducing db.py, a library
built for doing exactly that! db.py
does handy stuff like:
- Manages/remembers your db connection strings
- Lets you easily search through your schema
- Allows for tab-completion for table and column names
- Integrates with IPython,
pandas
, and is built for interactive programming - Works with most relational databases
To get started, just run:
$ pip install db.py
Connecting to your Database
First things first: you need to connect to your database. This can be much
tricker than it should be. Dealing with connection strings is never fun--how
many times have you forgotten your password (or even your username) for a
database. db.py
makes it as straight forward as possible and will always
revert to "reasonable defaults". For example, the default MySQL port is 3306. If
you forget to specify a port, db.py
will use 3306 as a default.
>>> from db import DB
>>> db = DB(username="kermit", password="rainbowconnection", hostname="db.themuppets.com",
dbname="muppetdb", dbtype="mysql")
Refreshing schema. Please wait...done!
>>> db
DB[mysql][db.themuppets.com]:3306 > kermit@muppetdb
db.py
will also save your credentials so you don't have to paste them into
every script. Now you'll never have to remember any of that connection stuff
again.
>>> db.save_credentials(profile="muppets")
Open a new sesion, load your profile, and voilà!
>>> from db import DB
>>> db = DB(profile="muppets")
Refreshing schema. Please wait...done!
>>> db
DB[mysql][db.themuppets.com]:3306 > kermit@muppetdb
Exploring Tables and Columns
Ever forget the name of a column? Or even forget the names of what table you're
trying to query? Me too. db.py
comes with a set of functions for navigating
your database's schema.
NOTE: For this post we're going to use the Chinook Database.
It ships with db.py
and is commonly used as an example database.
>>> from db import DemoDB # this is the Chinook DB and is used in the db.py docs/examples
>>> db = DemoDB()
Indexing schema. This will take a second...finished!
>>> db.find_table("A*")
+--------+--------------------------+
| Table | Columns |
+--------+--------------------------+
| Album | AlbumId, Title, ArtistId |
| Artist | ArtistId, Name |
+--------+--------------------------+
Ever forget how to join your company's prod_users
, prod_trans_monthly
,
and prod_accounts
tables together? (a better question might be: have you
ever remembered?)
No worries, db.py
has got you covered!
>>> db.tables.Artist
+----------------------------------------------------------+
| Artist |
+----------+---------------+--------------+----------------+
| Column | Type | Foreign Keys | Reference Keys |
+----------+---------------+--------------+----------------+
| ArtistId | INTEGER | | Album.ArtistId |
| Name | NVARCHAR(120) | | |
+----------+---------------+--------------+----------------+
>>> db.find_column("*Id")
+---------------+---------------+---------+
| Table | Column Name | Type |
+---------------+---------------+---------+
| Album | AlbumId | INTEGER |
| Album | ArtistId | INTEGER |
| Artist | ArtistId | INTEGER |
| Customer | SupportRepId | INTEGER |
| Customer | CustomerId | INTEGER |
| Employee | EmployeeId | INTEGER |
| Genre | GenreId | INTEGER |
| Invoice | InvoiceId | INTEGER |
| Invoice | CustomerId | INTEGER |
| InvoiceLine | InvoiceId | INTEGER |
| InvoiceLine | TrackId | INTEGER |
| InvoiceLine | InvoiceLineId | INTEGER |
| MediaType | MediaTypeId | INTEGER |
| Playlist | PlaylistId | INTEGER |
| PlaylistTrack | TrackId | INTEGER |
| PlaylistTrack | PlaylistId | INTEGER |
| Track | MediaTypeId | INTEGER |
| Track | TrackId | INTEGER |
| Track | AlbumId | INTEGER |
| Track | GenreId | INTEGER |
+---------------+---------------+---------+
Tab Completion
db.py
also identifies and loads schema info about all of the tables in your
database and makes them attributes of the DB()
. It works with IPython's
type-ahead functionality, so when you're mucking around in IPython Notebook
or in Terminal, you'll be able to see all of the tables and columns in your
database.
You can also take a look at all of your tables by accessing the tables
variable.
>>> db.tables
+---------------+----------------------------------------------------------------------------------+
| Table | Columns |
+---------------+----------------------------------------------------------------------------------+
| Album | AlbumId, Title, ArtistId |
| Artist | ArtistId, Name |
| Customer | CustomerId, FirstName, LastName, Company, Address, City, State, Country, PostalC |
| | ode, Phone, Fax, Email, SupportRepId |
| Employee | EmployeeId, LastName, FirstName, Title, ReportsTo, BirthDate, HireDate, Address, |
| | City, State, Country, PostalCode, Phone, Fax, Email |
| Genre | GenreId, Name |
| Invoice | InvoiceId, CustomerId, InvoiceDate, BillingAddress, BillingCity, BillingState, B |
| | illingCountry, BillingPostalCode, Total |
| InvoiceLine | InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity |
| MediaType | MediaTypeId, Name |
| Playlist | PlaylistId, Name |
| PlaylistTrack | PlaylistId, TrackId |
| Track | TrackId, Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, Uni |
| | tPrice |
+---------------+----------------------------------------------------------------------------------+
IPython and pandas
db.py
is built with popular Scientific Python tools in mind--namely
pandas
and IPython. We've tried to make it as friendly as possible for those
of you doing interactive computing: this means friendly repr's!
>>> db.tables.Artist
Column | Type | Foreign Keys | Reference Keys |
---|---|---|---|
ArtistId | INTEGER | Album.ArtistId | |
Name | NVARCHAR(120) |
>>> db.tables.Track
Column | Type | Foreign Keys | Reference Keys |
---|---|---|---|
TrackId | INTEGER | InvoiceLine.TrackId, PlaylistTrack.TrackId | |
Name | NVARCHAR(200) | ||
AlbumId | INTEGER | Album.AlbumId | |
MediaTypeId | INTEGER | MediaType.MediaTypeId | |
GenreId | INTEGER | Genre.GenreId | |
Composer | NVARCHAR(220) | ||
Milliseconds | INTEGER | ||
Bytes | INTEGER | ||
UnitPrice | NUMERIC(10,2) |
>>> df = db.tables.Artist.all()
>>> df.head()
ArtistId | Name | |
---|---|---|
0 | 1 | AC/DC |
1 | 2 | Accept |
2 | 3 | Aerosmith |
3 | 4 | Alanis Morissette |
4 | 5 | Alice In Chains |
Querying data
Executing queries in libraries like psycopg2
can be frustrating. Managing
connections, cursors, and rollbacks isn't exactly intuitive. db.py
operates on
a much higher level, and makes executing queries simple.
>>> db.query("select * from Artist limit 10;")
ArtistId Name
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette
4 5 Alice In Chains
5 6 Antônio Carlos Jobim
6 7 Apocalyptica
7 8 Audioslave
8 9 BackBeat
9 10 Billy Cobham
>>> with open("myscript.sql", "w") as f:
... f.write("select * from Artist;")
>>> df = db.query_from_file("myscript.sql")
Managing Credentials
You know those really obscure database hostnames and passwords that your sys
admin sets up? Well instead of copying and pasting them into all of your
scripts, db.py
lets you save credentials so you can use them later.
>>> db = DB(username="kermit", password="rainbowconnection", hostname="dw.themuppets.com", dbtype="mysql")
>>> db.save_credentials(profile="muppetdw")
# in a new session
>>> db = DB(profile="muppetdw") # voila!
Hope you like it
We're still working on support for MS SQL. If you have ideas for features or are
interested in getting the latest updates, check out the db.py
github page.
Special thanks to Eli Bressert from StichFix
for giving us the idea for db.py
!