Tuesday, January 17, 2017

Bug Tale

Yesterday, towards the end of the day, one of my staff came to me with a bug in cSuneido. Actually it was the second time he'd brought it to me. The first time I sent him away with a request for more details. This time he brought me a printed screen shot of the error. Perhaps that seemed more tangible.

Sometimes I get the impression that my staff gets a certain amount of secret enjoyment out of pointing out my bugs. Some of that's probably just my imagination and frustration. But even if it's true, it's understandable. They have to put up with me critiquing their code all the time.

I wasn't happy about it. It had been a busy day with the usual parade of interruptions. I didn't feel like I'd accomplished anything, and now this steaming turd had been deposited on my desk. I started to work on the bug, although I was pretty sure there wouldn't be enough time left in the day to get anywhere on it. The only advantage of starting late was that the interruptions had died out now that most people had left for the day.

On the positive side, the bug was repeatable. On the negative side it wasn't simple to reproduce - you had to run client-server, open our application, and then run a bunch of queries for over a minute. After thinking I'd fixed it several times I realized that it also only happened the first time you did this after starting the client. Subsequent runs worked fine.

The error itself was an access violation. The more I work in "safe" languages like Java or JavaScript or Go, the more I hate unsafe languages like C++. An access violation could be anything - a dangling pointer, a garbage collection issue, an uninitialized variable, an invalidated reference ...

On top of this, the error occurred inside a background fiber. Even better, it didn't occur when I ran the debug version of cSuneido.

As I expected, I didn't get anywhere before I headed home for the day, pissed off.

After supper I debated whether to work on it some more. If I could make a little progress, even just get a clue or narrow it down, then I'd end the day feeling a lot better. On the other hand, if I just ended up banging my head on the wall, I'd probably feel even worse.

I took the gamble, of course. Programmers are nothing if not eternal optimists. They have to be. But I hedged my bet by not "getting serious" and firing up the big iMac. I just sat in the living room with my laptop. That way if I failed I could tell myself it was because I'd just been poking around.

I didn't find the bug, but I did narrow it down enough that I felt I was hot on the trail. I could recreate the problem with a much simpler test case, and I'd found that I could recreate it in the debugger as long as I used the release build. It's harder to debug in the optimized version but being able to use the debugger at all was a big help.

It turned out the only significance of the queries running for over a minute was that during that minute several timer tasks got queued and ran concurrently when the queries ended. I could get the same result by just starting two fibers "at the same time".

Thankfully, the next day I was working at home and could focus on the problem. It was quite puzzling at first. I could see (in the debugger) exactly where the problem was, but the code looked correct, and almost all the time it worked correctly. I even resorted to looking at the assembly language and register contents, something I haven't done for a long time.

Stepping through the code I found there was a fiber context switch in the middle of the problem lines. And from looking at the assembler it was pretty obvious the compiler was caching the results of some common subexpressions, which it probably wasn't doing in the debug version. But I couldn't see how that caused a problem.

With fibers being cooperative and not pre-emptive, you don't really need to worry about concurrency issues. But this turned out to be a concurrency issue after all.

The problem lines were:

tls().thedbms = dbms_remote_async(server_ip);
tls().thedbms->auth(tok); 


tls() was getting evaluated and cached. But if dbms_remote_async "blocked" waiting to connect, then another fiber would run, and if that other fiber created a new fiber, and this caused the fibers vector to grow (reallocate), then the cached value of tls() would be invalid, causing the access violation.

Sure enough, if I called reserve on the vector to pre-grow it, then the problem went away.

It only happened the first time because after that the vector wouldn't need to grow and the tls() reference would stay valid.

I was grateful that the problem was so repeatable. If this had been real threads it would have been much more erratic. One of the advantages of fibers is that they are deterministic.

One local fix was to rewrite it as:

auto dbms = dbms_remote_async(server_ip);
tls().thedbms = dbms;
dbms->auth(tok); 

But where else in the code did I have this potential problem? And what would stop me from reintroducing the problem in future code.

My next thought was that I needed to tell the compiler that tls() was "volatile", i.e. it could change concurrently. But that wasn't really the problem. Even in single threaded code inserting into a vector invalidates any references, that's part of its contract.

One option was to use Windows fiber local storage. This didn't exist back when I rolled my own.

Another option was to dynamically allocate the tls structure so it didn't reside inside the vector.

However, there could potentially be other reference into the vector. I'd had problems with this in the past and "solved" them by using indexes into the vector rather than pointers. But it was still something I had to continually be on the lookout for.

Instead, I decided to switch from a vector to a simple fixed size static array. cSuneido isn't designed for huge numbers of fibers anyway. References into a fixed size static array were safe, nothing can invalidate them.

Problem solved (fingers crossed) and my mood has distinctly improved :-)

If you're interested in the nitty gritty, the change is on Github

Tuesday, January 10, 2017

Windows Overlapped Socket IO with Completion Routines

Up till now, cSuneido has used WSAAsyncSelect to do non-blocking socket IO. But WSAAsyncSelect is deprecated and it's not the nicest approach anyway. cSuneido needs non-blocking socket IO for background fibers, the main fiber uses synchronous blocking IO. (Although that means the main fiber will block background fibers.) Note: Windows fibers are single threaded, cooperative multi-tasking, coroutines. The advantage of fibers is that because they are single threaded and you control the context switches, you don't have the concurrency issues you would with "real" preemptive threads.

I thought that the WSAAsyncSelect code was the source of some failures we were seeing so I decided to rewrite it. My first rewrite used a polling approach. I know that's not scalable, but cSuneido doesn't do a lot of background processing so I figured it would be ok. Basically, I put the sockets in non-blocking mode, and whenever an operation returned WSAWOULDBLOCK the fiber would give up the rest of its time slice (e.g. 50ms) This was quite simple to implement and seemed to work fine.

However, I soon found it was too slow for more than a few requests. For example, one background task was doing roughly 400 requests. 400 * 50 ms is 20 seconds - ouch!

Back to the drawing board. One option was to use WSAEventSelect, but it looked complicated and I wasn't quite sure how to fit it in with the GUI event message loop.

Then I saw that WSARecv and WSASend allowed completion routines, a bit like XMLHttpRequest or Node's non-blocking IO. This seemed like a simpler approach. The fiber could block (yielding to other fibers) and the completion routine could unblock it.

At first I thought I had to use WSASocket and specify overlapped, but it turned out that the regular socket function sets overlapped mode by default. That's ok because it has no effect unless you use WSARecv or WSASend in overlapped mode.

Sending was the easy part since there was no need to block the sending fiber. It could just "fire and forget". One question was whether it would always do the full transfer or whether it might just do a partial transfer and require calling WSASend again (from the completion routine) to finish the transfer. I couldn't find a definitive answer for this. I found several people saying that in practice, unless there is a major issue (like running out of memory), it will always do the full transfer. Currently I just have an assert to confirm this.

Receiving is trickier. You may need to block until the data is available. And the completion routine may get called for partial data in which case you need to call WSARecv again for the remainder. (Although this complicates the code, it's actually a good thing since it allows you to request larger amounts of data and receive it as it arrives.)

WSASend and WSARecv can succeed immediately. However, the completion routine will still be called later. And for WSARecv at least, "success" may only be a partial transfer, in which case you still need to block waiting for the rest.

One complication to this style of overlapped IO is that completion routines are only called when you're in an "alertable" state. There are only a handful of functions that are alertable. I used MsgWaitForMultipleObjectsEx in the message loop, and SleepEx with a zero delay in a few other places. Note: although the MSDN documentation is unclear, you must specify MWMO_AWAITABLE for MsgWaitForMultipleObjectsEx to be alertable. (And it has to be the Ex version.)

Each overlapped WSASend or WSARecv is given an WSAOVERLAPPED structure and this structure must stay valid until the completion routine is called. I ran into problems because in some cases the completion routine wasn't getting called until after the socket had been closed, at which point I'd free'd the WSAOVERLAPPED structure. I got around this be calling SleepEx with a zero delay so the completion routines would run.

When I looked at some debugging tracing I noticed that it seldom blocked for very long. So I added a 1ms SleepEx before blocking to see if the data would arrive, in which case it wouldn't need to block and incur a context switch. This eliminated some blocking, but sometimes it didn't seem to work. I realized it was probably because the sleep was getting ended by an unrelated completion routine (e.g. from the preceding write). So I added a loop to ensure it was waiting at least a millisecond and that fixed it. Of course, I'm testing with the client and server on the same machine so the latency is very low. Across the network it will be slower and will still need to block sometimes.

Although the code wasn't that complicated, it took me a while to get it working properly (i.e. fast). As always, the devil is in the details. But the end result looks good. Background socket IO now runs about 30% faster than the old WSAAsyncSelect version, and almost as fast as the foreground synchronous blocking IO.

Sunday, January 01, 2017

Java ByteBuffer Gotcha

I had a weird bug in some Java code in jSuneido that took me a while to figure out. I briefly thought I'd found a bug in the ByteBuffer implementation, although I realized that was a low probability. (Especially given the confusing nature of ByteBuffer.) In the end it makes sense, although it perhaps could be documented better.

Here's the scenario - you slice or duplicate a ByteBuffer. This makes a new ByteBuffer instance that shares its data with the original. Then you compact the original buffer. Note - this does not modify the data that you are interested in. However, it does move it.

ByteBuffer buf = ByteBuffer.allocate(8192);
for (int i = 0; i <= 300; ++i)
    buf.put((byte) i);
buf.flip();
for (int i = 0; i < 12; ++i)
    buf.get();
ByteBuffer slice = buf.duplicate();
System.out.println(slice.get(0));
buf.compact();
System.out.println(slice.get(0));

This will print 0 and then 12, i.e. the slice has changed, even though you didn't explicitly modify the buffer.

In retrospect it's obvious - compact does alter the buffer, which is shared with the slice. So the contents of the slice "changes".

I'd be tempted to add a warning to the documentation for compact that it will "invalidate" any existing slices or duplicates, the same way that C++ documentation points out which operations invalidate iterators. But maybe it should be obvious.

I'm mostly posting this because my searches for the problem didn't find anything. Perhaps this will help someone in the future find the problem quicker.

Thursday, December 01, 2016

Mobile Git

I've used the Textastic source code editor iOS app on my phone and tablet for quite a while. I don't do a lot of heavy editing with it, but it's handy when I'm thinking about code and want to look at how something is implemented. (Textastic is also available on the Mac, but I've never tried that version. I have plenty of editors there, and I prefer ones that are also available on Windows.)

One of the limitations was that it wouldn't pull directly from Git. It does pull from Dropbox though, so I ended up keeping a copy of my source code there. But that had a few drawbacks. First, I had to remember to update the Dropbox copy of the code. Second to update the Textastic version I had to download the entire source, which could be slow depending on the connection.

Recently I discovered the Working Copy Git iOS app which lets you clone Git repos to your phone or tablet. You can even commit changes, although I generally wouldn't want to do that when I couldn't test the changes. It would be ok for documentation. It has its own editor, but even better it uses new iOS features to let you access the files from other editors, like Textastic.

Now, not only do I have easily updated offline access to my source code, I have offline access to my Git history.

One minor disappointment was that neither of these apps has syntax highlighting for TypeScript. I was a bit surprised because Typescript seems as popular as some of the other languages they include. Textastic does support Textmate bundles so in theory you could add it that way, but it didn't sound easy so I haven't bothered yet. It would be easier (for me!) if they just included it out of the box.

Unfortunately for Android users, both of these apps appear to be iOS only. If you know of good Android alternatives, let us know in the comments.

Monday, September 19, 2016

Progress on suneido.js

A project like suneido.js goes through distinct stages.

The first stage is just thinking and researching the platform. (In this case JavaScript, ES6/2015, TypeScript, node.js etc.) I've been doing that in the background for quite a while.

But eventually you have to start writing some actual code, and that's what I've been doing lately. This can be a frustrating stage because it's three steps forward, two steps back (and that’s on a good day). You pick an area to implement, decide on an approach, and write some code. That often goes smoothly. But then you start on the next area and you realize that the approach you picked for the first area won't work for the next. So you come up with an approach that works for both areas and go back and rewrite the existing code. Rinse and repeat.

Progress seems painfully slow during this stage. Important advances are being made, it’s just that it’s in terms of design decisions, not lines of code. (On the positive side, the Suneido language is, in many ways, quite similar to JavaScript, so the impedance mismatch is much lower than with C++ or Java. Meaning I can map many features quite directly from one to the other.)

There's also learning involved since I'm on a new platform, with a new language and tools. I went through the same thing with jSuneido since I'd never worked in Java before I started that project. It's one thing to read books about languages and platforms. It's a whole 'nother story using them to implement a large project. (I have to say, ES6/2015 features have been a big plus on this project, as has TypeScript.)

Eventually the approach firms up and all of a sudden you can make rapid progress in the code. It almost becomes mechanical, simply translating from jSuneido or cSuneido. Of course, issues still arise as you encounter different corners and edge cases. But mostly they are easily addressed.

It reminds me a bit of search strategies like simulated annealing where you start by making large changes all over, and as you get closer to a solution, you “cool down” and the changes are smaller and smaller as you approach a solution. Of course, it doesn’t mean you’ve found the “best” solution. But hopefully the initial larger explorative jumps covered enough of the search space that you’ve found something reasonably close to optimal.

I’m always surprised by just how many dead ends there are. Of course, when you’re programming the possibilities are infinite, so a lot of them have to be wrong. On the other hand, there is supposed to be intelligent design at work here, which you'd think would avoid so many dead ends. But many things are hard to evaluate before you actually try them. (Which brings to mind Wolfram’s idea of computational irreducibility.)

The reward for this plodding is that seemingly all of a sudden, I have a system that can actually handle real problems, not just limited toy examples. There is still lots to do, but suneido.js can now successfully run some of Suneido's standard library tests. It takes a surprising amount of infrastructure to reach this stage. Even “simple” tests tend to use a lot of language features.This seems like a big leap forward, but I know from implementing jSuneido that it’s really just another small step in a long process.

It's a bit hard to follow the project I'm afraid. The TypeScript code is on GitHub, but the actual transpiler is written in Suneido code and lives on our private version control system. And running the system requires jSuneido (which is used to parse Suneido code to an AST) and cSuneido (for the IDE), plus node.js and Typescript and other bits and pieces. I really should try to simplify or at least package it so it's easier for other people to access.

I'm glad to have reached this (positive) point, since I’m heading off travelling again, and probably won’t get too much done on this for a while.

PS. Hearing about some paragliding deaths recently I thought (not for the first time), "what if I get killed paragliding?". And my first thought was, "but who would finish suneido.js?". Ah, the mind of a true geek :-)

Friday, September 02, 2016

Old School

As I progress on suneido.js I'm accumulating a number of miscellaneous commands I need to run regularly. So far I've avoided any kind of build tool. I started writing shell scripts (OS X) and batch files (Windows) which worked but is ugly. And led to git problems with executable permissions.

So I started looking at build tools. The big one these days seems to be Gulp but so far I don't need its fancy streaming and plugins. So I searched for something simpler, and came across a comment that said something to the effect "if you're from those days, you could use make". They likely meant it in a derogatory way, but it made sense to me. I already have make on Windows and OS X and know how to use it. And it's dead simple to use it to simply run a few commands.

So now I can run "make bundle" or "make test". It's perfect for what I need. The hip kids can shake their heads all they want.

Thursday, August 25, 2016

Modules Again

Things had been running smoothly in suneido.js with the UMD approach. Until today when I went to run my code in the browser (I hadn’t done this for a little while because I’d been working on transpiling) and I got errors saying one of my modules couldn’t be found. I realized that I had started using a Node.js module in my code, which of course couldn’t be found in the browser. Crap!

I searched the web and most recommendations were to use browserify. It sounded straightforward, and it made sense to combine all my JavaScript into one file. So I installed browserify and switched my tsconfig.json back to commonjs.

But it wouldn’t work. It kept telling me it couldn’t find one of my modules. But as far as I could tell there was nothing different about that particular module or where it was imported.

I finally remembered that the flashy dependency graph feature of Alm had shown I had a circular dependency. Was that a problem with browserify? Searching the web didn’t find anything definitive, but there were some references to it being problematic.

It wasn’t hard to fix the dependency (and I’d been meaning to do it eventually anyway).

But it still didn’t work! Then I realized that the code shown in the Chrome developer tools debugger wasn’t my fixed version. Caching? A little more digging and I found you can right click on the refresh button and pick “Hard Refresh and Reset Caches”. Finally it worked! (Although I wondered if any of my other approaches would have worked if I’d done this???)

It seems like a reasonable solution, other than taking a dependency on yet another tool and adding another build step. (I do have to admit that npm makes it easy to install these tools.)