Apr 03 2014

Few days ago I wanted to update from PostgreSQL 9.1 to PostgreSQL 9.3, an action that initially should be easy (famous last words ;)
As I had also installed PostGIS 2.0, of course this ended in problems and errors. Luckily, I was able to find the solution after a lot of searching around the net and error-based trials, so let's see if this helps anybody.

 

First of all, if after updating PostgreSQL you try to recompile and reinstall PostGIS 2.0, you will probably get a similar error in the "make install" step:

ERROR:  could not load library [...] undefined symbol: GETSTRUCT

What this error really tries to say is "your PostGIS version is outdated for this PostgreSQL version, please update".

Basically, PostgreSQL 9.3 requires PostGIS 2.1 (I installed 2.1.1).

 

And because history tends to repeat itself, there is a very handy compatibility chart between both software pieces at the PostGIS wiki.

Mar 29 2014

I've always despised Windows scripting capabilities due to being really crappy: Either the old MS-DOS days .BAT files, which were ok in the early 90s but became obsolete quite fast, or the dreaded .VBS, taking the worst of both worlds (Visual Basic and OS scripting) and giving birth to ugly and awful scripts.

I've always admired how Linux had really powerful scripting capabilities, and questioned why Microsoft didn't copied them. Probably their policy of "eat your own dogfood" even if is VB.

Now that I work many hours daily with Linux, I'm starting to understand why people sometimes builds scripts instead of "real tools" with compiled binaries: You shoudn't need all the compiler and extra code steps for simple tasks (or to change them in the future).

 

Powershell has been for quite some years around, but apart from reading of his capabilities I didn't had any urge to use it. While it really takes a bit to "boot up" (more than a normal command prompt, that is), and the errors are sometimes too verbose, I've started to check some quick tutorials and specific how-tos in order to stop building C# binaries for trivial tasks.

Last year I built a small images thumbnail generator, that grabbed all JPGs in a folder and stored 150x150 thumbnails in a subfolder, creating it if didn't existed, and saving a "log" file.

This week, I used that code as the base for building an image watermarking script; put a "watermark.png" image in the folder where you have the JPGs you want to watermark, and it will add it to the lower-right corner of each image, saving them in a subfolder.

 

You can check both scripts at my small PowerShell github repo.

 

Another thing that sucks at Windows and that PS fixes is mass file renaming. The following line replaces whitespaces by underscores:

Dir | Rename-Item –NewName { $_.name –replace " ","_" }

 

While any Linux user might joke about how fat and slower is in contrast with .sh scripts, truth is that having access to the full .NET Framework allows you to do really impressive things. For example, with linux I would have had to install ImageMagik to do any image manipulation ;)

 

I will probably migrate more C# snippets that I have to PowerShell next time I use them. This way I stop needing Visual Studio installed for quick changes, which despite of having in a VM, I'm becoming lazy to boot a virtual OS for only that reason.

Mar 20 2014

Some database systems like MySQL allow to use LIMIT in UPDATE queries, which is quite handy (it also allows offsets, so things like pagination are dead simple), but not all major DB engines allow it.

There aren't so many uses for this scenario, but one is to perform batched updates in resource intensive or long running queries (to avoid exhausting all resources or getting the feared statement timeout error).

 

One RDBMS that doesn't allows UPDATE with LIMIT is PostgreSQL, but in those cases, there's also a solution, doing an UPDATE-SELECT:

UPDATE mytable SET field=TRUE
  WHERE id IN ( SELECT id FROM mytable LIMIT 500 )

 

Of course that subquery can be as simple or complex as you wish. You can ORDER BY, or filter WHERE field=FALSE, or more complex rule. It's up to you how you handle the logic, but you can replicate the behaviour.

 

At CartoDB we're starting to face huge data imports that spawn single tables that grow to more than 5GB before even finishing being transformed and adapted to the system, so limiting the amount of rows you georeference in the same batch or avoiding running a GIS geometry transformation function upon millions of rows (which can easily last quite a few minutes) is becoming a high priority.

 

Applying this idea of batching queries I've come with a pretty simple but working solution to batch any import-related query. Just feed in some parameters, the query optionally specifying where two placeholders go, and let it do the magic.

What I do have fine control of which rows have already been updated is adding a temporally column to the table and drop it after the last batch finishes. It has an index (thanks for the idea @luisico) because it will be accessed a lot:

ALTER TABLE table_name ADD column_name BOOLEAN DEFAULT FALSE;

CREATE INDEX idx_column_name ON table_name (column_name);

 

But, while doing some tests and checking running processes I saw that postgres was autovacuuming constantly during the import, which has a big impact on performance, so I did another optimization, disabling auto-vacuuming until all import tasks are done and the table is ready:

ALTER TABLE table_name SET (autovacuum_enabled=FALSE, toast.autovacuum_enabled=FALSE)

ALTER TABLE table_name SET (autovacuum_enabled=TRUE, toast.autovacuum_enabled=TRUE)

 

After this changes, I was able to batches of 50,000 rows with normalizations and transformations in ~1 minute where before that same dataset was timing out with statement_timeout set to 5 minutes.

 

Without batching but applying the index and no autovacuum we can squeeze more "rows processed per second" but we risk (and believe me, it happens) getting a too big dataset that ends up failing, so I think it's better to go a little slower and instead make sure there's some hard limit on resources.

 

The task is far from over because when you remove a bottleneck usually another appears, so further steps in the import process now need to be improved to handle GB datasets and/or several million rows, but where's the fun if there is no challenge?

 

Bonus: Other resources I'm researching to optimize importing data to postgres are PostgreSQL Server tune-up, Background Writer checkpoints configuration and Write Ahead Log synchronization config.

Feb 28 2014

When you build software, many times you think about edge cases, limits, race conditions, but mostly in the lower bounds: Division by zero, no input, wrong input ending in no valid data...

But there are upper bounds too, which usually create nasty scenarios: Out of memory errors, no hard disk space available, the operating system killing a process thinking it "hang up" eating all CPU, buffer overflows...

You should set a hard limit for most stuff, even if it's an insane one, so you get covered in those situations. But hard limit doesn't means reasonable limit.

 

For example, at CartoDB we face data imports of CSV and Excel files that sometimes weigth hudnreds of megabytes. You always have quota limits, so we won't process a file bigger than your available disk quota, but have you stopped to think what does represent 500MB?

I have a real world scenario of an apparently harmless 7MB XLSX file, which contains inside ~200K rows, each with 4 columns. I also have seen CSVs that stored 2-3 million rows in less than 500MB.

Any of this cases requires a lot of processing only to open it. Even native applications like OpenOffice struggle for minutes at 100% CPU. So imagine a system that opens the file, reads the whole contents, normalizes and sanitizes those contents, inserts the data into a DB, and then performs georeferencing and other calculations.

The sad reality is that in general, most huge imports will fail because they reach our hard limits, but usually that will happen after wasting minutes not only of processing but also of user time, waiting for the import to finish. And while we can always explain the user if complains that he should split those huge files in smaller ones, import individually and then merge the data into a table (once inside PostgreSQL, there's no problem having millions of rows in the same place)... It would be better to preventively detect unreasonable scenarios, don't you think so?

 

Also, it is in human nature to compare, so after a colleage told me to see how Google Drive handled huge Excel conversions, I got another "signal" it was the way to go. This is what happens when you try to upload huge XLS/XLSX files to GDrive and wait between 12 to 15 secs:

Google Drive's XLSX Error

It seems Google guesses the size, and if is off-limits, it gives an error. Activating the beta new Google Sheets didn't helped, the same file kept failing.

But what is funnier is that Google really guesses instead of checking the actual contents, because I opened the file, left ~398K cells (less than 200k rows with only two columns) and it kept complaining about the 400k cells limit.

 

So, the mighty Google fails to import those files, and it fails quite fast: 12-15 seconds.

 

Well, I'm curious, so I decided to dissasemble one XLSX, read about its format and see if I could do a fast scan to estimate the size myself, like Google apparently does (either that or they use powerful machines but if they take more than X seconds kill the conversion).

I was lucky because in the worksheets, inside each sheet's XML file, there is a handy tag around the first 512 bytes that says something like this:

<dimension ref="A1:D202464"/>

Which means that roughtly this worksheet will contain 4 columns and 202464 rows, a total of 809856 cells.

I just needed some Ruby code to:

  • Open the XLSX (which is a ZIP file)
  • Extract the first worksheet to a temporally folder
  • Read the first 512 bytes
  • Extract that dimension
  • Calculate the total cells of the sheet

By far the simplest part, so although the code is missing error handling, documentation and tests to make sure it always work as intended, you can check my xlsx-inspector class at my GitHub.

 

Best of all is that it takes just 2 seconds on my laptop to inspect the same file google takes 12 secs minimum, and returns more a accurate estimation :D

I plan to build a gem and publish it so it is easier for us to include it at CartoDB while I keep it updated, but meanwhile you can just download the code. It also only needs a single gem (rubyzip).

 

Probably now doing this "detection" logic in pure C/C++ would take a fraction of a second, but as our main stack is either Ruby or Javascript I'll leave it as it is.

 

So, if are not going to be able to process something, if you can't win, fail fast and save resources.

 

Bonus track:
To do a dirty but fast CSV estimation, use Linux command "wc -l yourfile.csv" to get the amount of rows. If you want to multiply by the number of columns, then you have to detect the field delimiter and then calculate the number of cols.

Feb 14 2014

I love LEGO Mindstorms since I knew of their existence in the early 2000s. I got all three of them: the RCX with a custom firmware to code it with C, NXT with a micro-JVM, and now the new EV3 with a Micro-SD card pending to be installed with a custom ARM Linux with Mono (for C#).

As I'm trying to push harder for Javascript in general as a cross-platform language, I decided instead to go for something different and harder, like coding the robot from Javascript. I came across Andrew Nesbitt's EV3 node library and decided to give it a try.

I had to switch the Bluetooth module as the original one didn't compiled under Windows * and most code looks as directly reverse engineered or captured from bluetooth communication, so I got into checking the EV3 operating system source code and decided to build my library almost from scratch to learn how the messages are send and received by the brick.

After pausing the idea with my recent job switch, I rushed this past days to be able to show something at the Freakend event last weekend, and now I have a quite small codebase for the library but at least organized and with most "obscure" code commented. The library only allows to connect, disconnect, play a tone or move the motors, but it is a good starting point.

I plan to at least implement support for the basic sensors that come with the EV3 and reading the state of the motors. I will not implement the whole OS library as I don't have so much spare time, but I do want to only code my brick with Javascript so eventually all basic functionality will be present.

Most interesting stuff of the official sources can be found at the c_com library; at least is what I'm using as guide for learning and building this lib.

 

You can find the library, along with a sample program that allows to move all four motors and play random beep tones with the keybord at my GitHub.

 

* That's why I had my fight with Windows installing Node.js.

More Posts: Next page »