Tips and Tricks for Analyzing Large Databases

Recently I have been working with a MySQL database that contains tables with millions of rows. VLDB (very large databases) is a research area in its own. However, I am not dealing with VLDB, which generally entails billions of rows per table with millions of columns. I am working with that middle ground which is neither very big nor very small.

A side note: If you wish to work with VLDB, consider MapReduce. MapReduce cannot solve EVERY problem, but it is an interesting framework nonetheless.

Data Replication: When dealing with a large MySQL database, it is important that we do not query/analyze a live database i.e., database in which live inserts/updates are in process. MySQL locks the entire table when doing an insert/update or select. Thus, querying a large live database will affect the query performance as well as the insert performance. In this scenario, it will be better to replicate the database as follows:

Note: paths are specific to Ubuntu 8.10 and MySQL 5.0.67

mysqldump /path/to/database > <databasename>.out

datadir variable in /etc/mysql/my.cnf contains the value of path to database. The default path is /var/lib/mysql.

Under mysql, create a new database. Then do source <databasename>.out. This will load all the tables and data in the new database.

Indexing: Indexing is important to retrieve rows quickly. When a particular column is indexed, queries can retrieve rows directly, instead of scanning each row and comparing the value. However, figuring out proper indexes is a tricky business. To aide developers, MySQL provides an explain statement. When we give any SELECT statement to explain, we get query optimizer’s best guess about the number of rows that need to be scanned and the indexes that will be used. Explain statement is described in more detail here.

Proper Database Engine: MySQL provides a multitude of storage engines that are appropriate for different scenarions. The default MyISAM engine is appropriate for large number of SELECTs and small number of INSERTs/UPDATEs. This is the most appropriate model for web usage, which is the main reason behind MySQL’s popularity. However, note that in our replicated database, there are absolutely no INSERTs. In addition, MyISAM uses table level locking to preserve data integrity i.e., for each insert/update and select, MyISAM locks the entire table. Table level locks can seriously affect the performance of parallel queries. In our case, we can work without any locks at all, because we are only reading data, which cannot affect the integrity of data. The best we can do with MySQL is to use InnoDB engine instead of MyISAM. InnoDB uses row-level locking instead of table-level locking, which can boost the performance of parallel queries. Few performance numbers for MyISAM and InnDB are provided here.

A MyISAM table can be converted into an InnDB table as follows:

ALTER TABLE <table name> ENGINE=InnoDB;

Note: Engine conversion takes long time. It took almost 1 whole day to convert a 33 million row MyISAM table to InnoDB table.

Python and MySQL – A match made in heaven: Python’s MySQL API makes it really easy to query the database. However, when issuing multiple independent queries it is importat to use a new a connection and a new cursor. Also, make sure to close the cursor once the query is completed.

Limits are bad bad bad: The Limit clause in a select statement allows us to retrieve a certain part of result set. For example, limit 1000 10 will return 10 rows starting from the 1000th row. In my opinion, limit clause is only useful for visual isnpection of results. However, if we issue limit queries during analysis, then it only slows down the process. For example, if the final results contain 33 million rows and we use a limit clause for 1000 rows, then instead of issuing 1 qeury, we will issue 33,000 queries. Very very bad!!!

If you know that the query will return large number of results which would overwhelm your local memory, then use SSCursor instead of default cursor. SSCursor stores the query results on the server, instead of the client. When fetchone or fetchmany calls are issued, the local client fetches the rows from the server instead of the local memory.

Design guidelines are not bible: Database normalization always leads to beautiful database design. However, it also leads to multiple joins. If you are joining multiple tables, which contain millions of rows, for majority of queries, then it might be helpful to combine those tables into one. While it is a bad design choice, it will seriously improve query performance.

If you have other suggestions for improving query performance over a large database, please post them in comments section.

http://www.mysqlperformanceblog.com/2006/05/29/join-performance-of-myisam-and-innodb/
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.