Importing data from a MySQL database into a Pandas data frame including column names

python

pandas

sqlalchemy

data-import

byNikita Barsukov·Nov 23, 2024

Want to import data from a MySQL database into a Pandas DataFrame, including the column names? Just use pandas.read_sql_query():

import pandas as pd
import sqlalchemy

# Create SQLAlchemy engine -- could also be a lawnmower engine, but not recommended.
engine = sqlalchemy.create_engine('mysql+pymysql://username:password@host/db_name')

# SQL query statement
sql = "SELECT * FROM table_name;" # Let's grab everything

# Import data directly into a DataFrame
df = pd.read_sql_query(sql, engine)

Just replace username, password, host, db_name, and table_name with your credentials and the table you wish to query. The result? df is decked out with all your database's table columns. Cool, huh?

Connection string

We're using SQLAlchemy to create a connection engine. The engine URL incorporates both authentication and database location:

'mysql+pymysql://username:password@host/db_name'

Consists of DBMS+pymysql://username:password@host/db_name, where you replace with your details. SQLite users, don't feel left out, replace mysql+pymysql with sqlite:/// and you're good to go.

Efficient strategies for big data retrieval

Facing a gigantic dataset? Use a WHERE clause to retrieve only necessary columns / rows. Also, consider batching the import using chunksize parameter in read_sql_query:

# Retrieving data in chunks
for chunk in pd.read_sql_query(sql, engine, chunksize=1000): # Chunks, not Chunky Pandas
    process(chunk)  # Replace this with your processing code, Magic not included.

Error handling and datatype assurance

Oh, and be prepared for the unpredictable. Handle exceptions from database connections or queries. Moreover, help Pandas to correctly infer datatypes:

df = pd.read_sql_query(sql, engine, dtype={'column_name': 'datatype'})

Replace 'column_name' with your column name and 'datatype' with the Pandas data type like np.float64, str etc. Don't let Pandas guess, it's not a soothsayer.

Advanced stuff

Got the basics? Let's move to more advanced techniques, turn that code up to 11.

SQLAlchemy ORM for complex queries

For complex queries and abstractions, fancy working with Python classes as database models? SQLAlchemy's ORM gotcha covered.

from sqlalchemy import Column, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

# Define your model - Robot not included
class MyTable(Base):
    __tablename__ = 'my_table'
    id = Column(Integer, primary_key=True)
    # define other columns

Session = sessionmaker(bind=engine)
session = Session()

# Querying feels like ORM-going on a date
query = session.query(MyTable)
df = pd.read_sql(query.statement, query.session.bind)

Parameterized queries

Injecting variables into your SQL statements? Use parameterized queries and keep it all safe:

params = {"limit_rows": 10}
sql = "SELECT * FROM table_name LIMIT :limit_rows;" # Limit? Trying to diet?
df = pd.read_sql_query(sql, engine, params=params)

Here, :limit_rows is a named parameter to be substituted with a value from params. Keeps it safe from SQL injection -- needles are scary.

Time series data handling

For time series data, Pandas is your BFF. Work your magic after import:

# Presuming 'my_date_column' is your datetime column -- or your time machine.
df['my_date_column'] = pd.to_datetime(df['my_date_column'])
df.set_index('my_date_column', inplace=True) # Time travellers, set your coordinates!