Few tips on working with huge database


#1

Hi all, I’ts the first time I am working with a huge database with over 10 million records.
I am already reading in through several sources about this.
I am using Lucee and MySQL. I’m not asking for a complete solution but some tips would be great.

  • How to enable users to search all items without overloading the db
  • How to enable navigation with pagination, limit the results?
  • In this case should I be thinking of using extensions like Ehcache, Lucene etc. ?

Thanks


#2

How many results do you need to show on screen?

Couple of quick things.

  • SQL index
  • Limit rows on open query (select top, limit, etc.)
  • Only select fields you need
  • Cache common queries.

#3

Showing 40/80 items per page would do. Normally I just create navigation through an entire record set of hundreds or thousands but that won’t work right now. And thanks for your tips, gonna check it out.


#4

This one is a good article to checkout for optimized pagination:
https://www.xarg.org/2011/10/optimized-pagination-using-mysql/

I think caching makes sense, when your query need to long or you have a load problem.
Otherwise i would not start caching it.


#5

Other things…

  • You can use TOP, LIMIT etc and pop up a message if recordcount eq limit to tell a user to revise a search.
  • Decide at which screens to much info is to much. One screen could allow a thousand where 200 might be to much on another screen.
  • Make sure they search boxes don’t allow less than 3 characters or so if needed.
  • Make sure you test with a slow database server for development. I had a customer complain once that a query took all resources for 30sec on their box. Our development DB server took 5 minutes. I fixed one index and some query house keeping took that down to 30 secs on our box which I think was under 5 secs on the customers DB server.

#6

A few other suggestions:

Use Views when ever possible
Use Query of Query
Database & Application tuning will help with response time & hardware requirements
Rewrite Slow Queries
Instead of “Select * from XXX” use Select Column, 2, 3, 4… from XXX"


#7

Use Query of Query



#8

Wow, everyone thanks so much for all the useful answers, this will do more then get me started. I’m gonna check all this info out, thanks.


#9

I do not think QoQ “sucks”, in all instances.

Running applications that have 100+ GB databases on public front end sites and QoQ works just fine both under AFC, Lucee, running both MySQL & MSSQL.

I am sure there are databases that QoQ would not work well, such as high memory clustered modern databases, some nosql databases, and other modern databases that have addressed many of the legacy problems of traditional databases.


#10

QoQ are handled by Lucee with the help of hsqldb, regardless of the database you use.


#11

Beware of using OFFSET for pagination. It does not scale. The database engine practically has to gather all of the results and then toss away most of them.

QoQ can be OK in edge cases, but when your goal is performance it should be avoided. Use the right tool for the right job. QoQ is not the right tool.

Create the indexes that you need for faster search, but do not add unnecessary indexes as they would hurt DML queries.

Understand the actual needs of your system and optimize it accordingly. For example, does your database have many more READ operations than WRITE? If so, you can use replication and add more databases in “read-only” mode, then load-balance your system by directing SELECTs into the read-only databases and DML operations to the master database.


#12

Thanks for that addition, will check it out :+1:


#13

Use StoredProcedure ~ CFSTOREDPROC or CFQUERY EXECUTE, rather than a standard Coldfusion query.

Index your database tables:

https://www.bennadel.com/blog/3467-the-not-so-dark-art-of-designing-database-indexes-reflections-from-an-average-software-engineer.htm

And never loop over your queries when writing to the database. Use a WHERE…IN() clause with multiple IDs or a sub query.


#14

Thanks Sir, will check it out :+1:


#15

I would add to the limit suggestion and advise you to really look at your listing page(s). Rare is the case that a user will need to or want to page through thousands, let alone millions of records. So what we’ve done is put a default limit of 1,000 rows on all our search queries and we only remove that limit when the user expressly needs more records. In our system, all of our listing pages have a parenthetical that discloses this 1,000 row limit (which is what they page through on the client). If the user needs more they have two choices. One is the download a csv, in which case we remove the limit and send them the complete data set (which is returned real time or via a scheduled download if the result set is enormous) and the other is to simply change their filter criteria.

The other thing I would suggest from a lot painful experience is to avoid running queries automatically in such pages. In other words, design your search listing pages in such a way that users have to press a search button after selecting their filters rather than having the page kick off a default search with little or no filtering. It sounds basic, but that alone will prevent a lot of server load in the form of one default search followed by one tailored to their needs.

Now reports are another matter. For reports where you might be returning a ton of data, it might be best to send the entire result set and forego paging altogether. And when your result set is simply too long (and you need to think about what that means) you can still handle that by informing the user that the report has been scheduled and will be emailed to them when it is done being generated. I mention this because there is a definite trade off in performance when trying to support online reporting where the data is very large. In many cases you might be better served sending your user a csv that they can load into Excel let them do their thing. Hardcore analysts might even thank you for that anyway.


#16

Awesome additional info, thanks :+1:


#17

Hi,

I got Lucee working with an Elasticsearch “database”. I let mySQL and Lucee create an Elasticsearch “database” every day (cron job) and the visitors of the website query the Elasticsearch file. Lightning fast!


#18

Interesting indeed, will check it out :+1:


#19

@MvdO FWIW, 10 million records in a database is hardly “huge”.

There are most likely many things that you can do with your database to improve its performance, before adding other components that will add new layers of complexity to your system.


#20

Thanks, for me it feels like huge :grin: I am sure there are much bigger db’s though. I had already many tips on improving performance so that will cover the things I can do with the db instead of other components.