Explain Codes LogoExplain Codes Logo

Importing data from a MySQL database into a Pandas data frame including column names

python
pandas
sqlalchemy
data-import
Nikita BarsukovbyNikita Barsukov·Nov 23, 2024
TLDR

Want to import data from a MySQL database into a Pandas DataFrame, including the column names? Just use pandas.read_sql_query():

import pandas as pd import sqlalchemy # Create SQLAlchemy engine -- could also be a lawnmower engine, but not recommended. engine = sqlalchemy.create_engine('mysql+pymysql://username:password@host/db_name') # SQL query statement sql = "SELECT * FROM table_name;" # Let's grab everything # Import data directly into a DataFrame df = pd.read_sql_query(sql, engine)

Just replace username, password, host, db_name, and table_name with your credentials and the table you wish to query. The result? df is decked out with all your database's table columns. Cool, huh?

Connection string

We're using SQLAlchemy to create a connection engine. The engine URL incorporates both authentication and database location:

'mysql+pymysql://username:password@host/db_name'

Consists of DBMS+pymysql://username:password@host/db_name, where you replace with your details. SQLite users, don't feel left out, replace mysql+pymysql with sqlite:/// and you're good to go.

Efficient strategies for big data retrieval

Facing a gigantic dataset? Use a WHERE clause to retrieve only necessary columns / rows. Also, consider batching the import using chunksize parameter in read_sql_query:

# Retrieving data in chunks for chunk in pd.read_sql_query(sql, engine, chunksize=1000): # Chunks, not Chunky Pandas process(chunk) # Replace this with your processing code, Magic not included.

Error handling and datatype assurance

Oh, and be prepared for the unpredictable. Handle exceptions from database connections or queries. Moreover, help Pandas to correctly infer datatypes:

df = pd.read_sql_query(sql, engine, dtype={'column_name': 'datatype'})

Replace 'column_name' with your column name and 'datatype' with the Pandas data type like np.float64, str etc. Don't let Pandas guess, it's not a soothsayer.

Advanced stuff

Got the basics? Let's move to more advanced techniques, turn that code up to 11.

SQLAlchemy ORM for complex queries

For complex queries and abstractions, fancy working with Python classes as database models? SQLAlchemy's ORM gotcha covered.

from sqlalchemy import Column, Integer from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker Base = declarative_base() # Define your model - Robot not included class MyTable(Base): __tablename__ = 'my_table' id = Column(Integer, primary_key=True) # define other columns Session = sessionmaker(bind=engine) session = Session() # Querying feels like ORM-going on a date query = session.query(MyTable) df = pd.read_sql(query.statement, query.session.bind)

Parameterized queries

Injecting variables into your SQL statements? Use parameterized queries and keep it all safe:

params = {"limit_rows": 10} sql = "SELECT * FROM table_name LIMIT :limit_rows;" # Limit? Trying to diet? df = pd.read_sql_query(sql, engine, params=params)

Here, :limit_rows is a named parameter to be substituted with a value from params. Keeps it safe from SQL injection -- needles are scary.

Time series data handling

For time series data, Pandas is your BFF. Work your magic after import:

# Presuming 'my_date_column' is your datetime column -- or your time machine. df['my_date_column'] = pd.to_datetime(df['my_date_column']) df.set_index('my_date_column', inplace=True) # Time travellers, set your coordinates!