Explain Codes LogoExplain Codes Logo

Python pandas to_sql with sqlalchemy: how to speed up exporting to MS SQL?

python
to_sql
sqlalchemy
performance
Nikita BarsukovbyNikita Barsukov·Dec 15, 2024
TLDR

Rocket your pandas to MS SQL export speed using SQLAlchemy's fast_executemany, redefining the meaning of bulk inserts and minimizing database round-trips.

# The engine is now a rocket, ready for efficient bulk inserts engine = create_engine("mssql+pyodbc://user:pass@server/db?driver=SQL+Server", fast_executemany=True) # Using our rocket engine to send DataFrame to MS SQL df.to_sql('your_table', engine, if_exists='append', index=False)

Note: Check the compatibility of fast_executemany with pyodbc and ODBC driver. Remember, adjusting the connection string to match your credentials is not rocket science

Pump up the speed with 'multi' method and chunk size

Use method='multi' in to_sql and shoot multiple rows per INSERT statement.

Fewer statements, less waiting

df.to_sql('your_table', engine, if_exists='append', method='multi', index=False)

pandas now packages rows in fewer INSERT statements, effectively speeding up the export.

Beating large datasets with chunksize

Define chunksize to keep the database's parameter limit in check and avoid choking up the transaction log with a beast of a dataset.

# The chunky solution for efficient export df.to_sql('your_table', engine, if_exists='append', method='multi', chunksize=1000, index=False)

What's your perfect chunk size? Test, observe, repeat!

Crank up performance: data types, CSVs, and dark arts

Lighter Data types: The need for speed

Scrutinize your dataframe's data types. Opt for less resource-intensive types like INT over BIGINT, or FLOAT over DECIMAL. Less is more!

CSV and BULK INSERT: The endgame

Large data sets weighing you down? Try exporting the data as CSV first and leaving the heavy lifting to SQL server's BULK INSERT functionality.

When in doubt, d6tstack

Explore d6tstack, the shortcut for pandas to SQL operations, featuring a superfast pd_to_mssql function.

Pre-insertion data processing: Stay ahead of the curve

Employ apply_after_read from d6tstack for preprocessing data. Perfecting data compatibility and insertion speed.

Go full throttle with SQLAlchemy and pandas

SQLAlchemy: Stay Informed, Stay Fast

Refer to SQLAlchemy documentation to stay in tune with 'fast_executemany' and other performance enhancing updates.

Pandas Update: Snooze and you lose

Always sport the latest pandas version for performance upgrades and new features like method='multi'.

Fine-tune For Optimized Performance

Measure it to improve it

Use Python's time.time() or the timeit module to measure export timings. Embrace the nerdy quantitative feedback.

Connection Tuning: Dial in the perfect connection

Try different parameters in the create_engine connection string to adjust workload, pooling options, or timeout settings.

Don't create an escort for your data

Set index=False in to_sql to avoid the unnecessary index column creation in SQL table. No one likes unnecessary luggage!