by Caleb Jaffa

Importing Data to MySQL with Ruby

When I first started coding with Ruby regularly my code was not always The Ruby Way. One task I’ve come back to time and time again is importing data from a text file into a database. Though I can’t recall the specific code used, I believe my first attempt loaded the whole file into memory before running the INSERTstatements. This was needlessly expensive on the machine doing the processing and slow. I had started the process before leaving work and on my commute home puzzled out the better way to do it. The new version that read a line in, inserted the data in the database and then went on to the next line forgetting about the previous took much less time to execute and worked wonderfully.

Later I took over primary development of a large set of Ruby scripts that more or less managed importing and exporting data from MySQL using flat files. This system communicated with corporate headquarters and individual stores about everything, customers, orders, products, special pricing agreements, etc. Processing time varied on file size, but most files took a couple of minutes to process even on new hardware and the largest set of files, a potentially daily input of data, could take up to an hour to get through. Now there was some processing with this data to populate some fields and sometimes joins were necessary to set things up right.

However the client was looking to be able to take this process from a handful of sites to potentially hundreds. The system’s architecture was changed so we could throw more hardware to scale. I also looked for optimizations that could be made. It made sense to do things they way they were, Ruby read in a line of data, performed any calculations or string concatenations and then inserted the data in the database. There was something faster though. MySQL’s LOAD DATA INFILE was brought in to do all the heavy lifting of getting data from the flat files into the database. Then aSQL query could be run to calculate any fields that needed to be calculated. This cut processing time overall for the most common file from minutes to seconds. It was a good reminder of how usually if your database server can do it, it’ll probably do it faster than whatever language you’re working with.

Comments are closed.