Feb 28 2014
When you build software, many times you think about edge cases, limits, race conditions, but mostly in the lower bounds: Division by zero, no input, wrong input ending in no valid data...
But there are upper bounds too, which usually create nasty scenarios: Out of memory errors, no hard disk space available, the operating system killing a process thinking it "hang up" eating all CPU, buffer overflows...
You should set a hard limit for most stuff, even if it's an insane one, so you get covered in those situations. But hard limit doesn't means reasonable limit.
For example, at CartoDB we face data imports of CSV and Excel files that sometimes weigth hudnreds of megabytes. You always have quota limits, so we won't process a file bigger than your available disk quota, but have you stopped to think what does represent 500MB?
I have a real world scenario of an apparently harmless 7MB XLSX file, which contains inside ~200K rows, each with 4 columns. I also have seen CSVs that stored 2-3 million rows in less than 500MB.
Any of this cases requires a lot of processing only to open it. Even native applications like OpenOffice struggle for minutes at 100% CPU. So imagine a system that opens the file, reads the whole contents, normalizes and sanitizes those contents, inserts the data into a DB, and then performs georeferencing and other calculations.
The sad reality is that in general, most huge imports will fail because they reach our hard limits, but usually that will happen after wasting minutes not only of processing but also of user time, waiting for the import to finish. And while we can always explain the user if complains that he should split those huge files in smaller ones, import individually and then merge the data into a table (once inside PostgreSQL, there's no problem having millions of rows in the same place)... It would be better to preventively detect unreasonable scenarios, don't you think so?
Also, it is in human nature to compare, so after a colleage told me to see how Google Drive handled huge Excel conversions, I got another "signal" it was the way to go. This is what happens when you try to upload huge XLS/XLSX files to GDrive and wait between 12 to 15 secs:
It seems Google guesses the size, and if is off-limits, it gives an error. Activating the beta new Google Sheets didn't helped, the same file kept failing.
But what is funnier is that Google really guesses instead of checking the actual contents, because I opened the file, left ~398K cells (less than 200k rows with only two columns) and it kept complaining about the 400k cells limit.
So, the mighty Google fails to import those files, and it fails quite fast: 12-15 seconds.
Well, I'm curious, so I decided to dissasemble one XLSX, read about its format and see if I could do a fast scan to estimate the size myself, like Google apparently does (either that or they use powerful machines but if they take more than X seconds kill the conversion).
I was lucky because in the worksheets, inside each sheet's XML file, there is a handy tag around the first 512 bytes that says something like this:
Which means that roughtly this worksheet will contain 4 columns and 202464 rows, a total of 809856 cells.
I just needed some Ruby code to:
- Open the XLSX (which is a ZIP file)
- Extract the first worksheet to a temporally folder
- Read the first 512 bytes
- Extract that dimension
- Calculate the total cells of the sheet
By far the simplest part, so although the code is missing error handling, documentation and tests to make sure it always work as intended, you can check my xlsx-inspector class at my GitHub.
Best of all is that it takes just 2 seconds on my laptop to inspect the same file google takes 12 secs minimum, and returns more a accurate estimation :D
I plan to build a gem and publish it so it is easier for us to include it at CartoDB while I keep it updated, but meanwhile you can just download the code. It also only needs a single gem (rubyzip).
So, if are not going to be able to process something, if you can't win, fail fast and save resources.
To do a dirty but fast CSV estimation, use Linux command "wc -l yourfile.csv" to get the amount of rows. If you want to multiply by the number of columns, then you have to detect the field delimiter and then calculate the number of cols.
Feb 14 2014
I love LEGO Mindstorms since I knew of their existence in the early 2000s. I got all three of them: the RCX with a custom firmware to code it with C, NXT with a micro-JVM, and now the new EV3 with a Micro-SD card pending to be installed with a custom ARM Linux with Mono (for C#).
I had to switch the Bluetooth module as the original one didn't compiled under Windows * and most code looks as directly reverse engineered or captured from bluetooth communication, so I got into checking the EV3 operating system source code and decided to build my library almost from scratch to learn how the messages are send and received by the brick.
After pausing the idea with my recent job switch, I rushed this past days to be able to show something at the Freakend event last weekend, and now I have a quite small codebase for the library but at least organized and with most "obscure" code commented. The library only allows to connect, disconnect, play a tone or move the motors, but it is a good starting point.
Most interesting stuff of the official sources can be found at the c_com library; at least is what I'm using as guide for learning and building this lib.
You can find the library, along with a sample program that allows to move all four motors and play random beep tones with the keybord at my GitHub.
* That's why I had my fight with Windows installing Node.js.
Dec 22 2013
I'm not going to write what I want for the next year. For once, I'll let the future unveil itself and just keep walking and enjoying the present.
What I'll do is a tiny retrospective of this year that is coming to an end.
I left Tuenti after almost 4 years, eager of different challenges. Time has proven me I made the right decision regarding what I sought, but also that Tuenti employees are great and I still miss sometimes "our old times". Thankfully beers now and then fill that gap.
I got again into another huge redesign at Minijuegos that took months of crunch time, but also research and experiments that I enjoyed a lot performing. I knew what I was getting into and I'm quite proud of what we built between such a small team, but was still exhausting.
I've given 4 new talks this year, from sneak previews of the Minijuegos redesign to how a game developer can use the APIs available at the website or how the Avatars System was built.
I'm going at a slow but sure pace with my univesity studies. Being a personal goal and not an imposition makes me approach it differently, I set my goals and try not to get stressed.
I'm going to kind of step out of my comfort zone, by switching jobs again. The 27th of December I'll take vacactions and finish my contract with Minijuegos, and mid-January I'll start a new adventure, still at Madrid but with a different toolset than this last years' PHP.
It feels really strange, switching twice in a year from a job to another...But anyway, I'm quite happy about the new opportunity, so as I started this post saying, let's focus on the "here and now".
Nov 25 2013
Lately I'm starting to learn a bit of how node.js works, so after some fights I've finally been able to have a decent setup for development without having to rely on a remote *NIX machine running node.
I'm just assuming that you already have installed Visual Studio 2013. I have the Premium Edition, I don't know if with the free Express Edition everything will work (specially node tools for VS).
This are the steps to setup most basic stuff needed:
1) Install node.js
I used the MSI installer x64 without any issue.
2) Install Pyton 2.7.x
Seems that 3.x is not fully supported, so I installed 2.7.6 (64-bit).
Remember to add to the PATH environment variable it's path, it didn't on my installation and node-gyp was not working.
3) Install node-gyp
Required for modules that need building from their sources. Uses Python and under windows Visual C++ runtime builder for .node files.
4) Install Node.js Tools for Visual Studio
Really useful, not only provides with project templates for node apps and websites, but also allows cool things like local or remote debugging or intellisense for all code.
This VS Tools are in alpha, but the main problem you will encounter with it is that npm will fail a lot, as many modules have only been tested under Mac/Linux and use .sh scripts, that of course won't work under Windows.
5) Install nodemon
To avoid manually restarting node server, I'm lazy so for me a must-have
Installing modules is the only painful part, but going to a Node command prompt and manually "converting" to windows .sh commands and running "node-gyp configure build install" usually either works or pops out required dependencies that you should fetch manually first. It's painful but blame the module developers, seems that node and npm properly support now Windows but only some up to date modules seem to take this OS into account.
Oh, and if after building node-gyp fails not finding the .node "binary", go to the \build subfolder, probably VC++ created a \build\release subfolder and put it there.