31 May 2010

XML Parsing and Indexing

One of the idea behind GFE ([dje-fe], see previous post) is to be the less intrusive and the most reactive application. A front-end shouldnt need long processes stopping the user about to play, nor any private files written on disk (especially if you want it running from a DVD).
But GFE needs to display informations about currently selected game (from a collection of 10000 items).
Such data exists and is usually available through (big) xml files... They're generated by Mame.exe, or available from the Net (google 'dat file emulator').
Hence the idea to parse like 40 Megs of xml data on the fly (at each run), and index/map the document to avoid keeping the whole file in memory.
After two weeks studying the topic, it's time for conclusions :
  1. Microsoft Xml Pull Parser is fast. Written in .Net, but *correctly* written, it's generally a good tool for your xml needs. It's a pull system, easier to manipulate than SAX, and probably faster (look at XmlTextReader for reference). It takes 750 ms to walk 40 megs of data (8800 records) on my computer.
  2. Unfortunately you can't bookmark the interesting parts of your document with it. We could rely on the line/character pair this parser is returning but then another text parser should extract text blocks from line/characters pairs. Maybe tricky to implement and we'll lose some horsepower in the process.
  3. Note that parsing xml files is easy as long as we don't want document validation or xpath support: state machine only has to handle 10-12 token types, c# is silently handling various character format.
So what ? Why not writing a dedicated indexing-xml-pull-parser ?
Some results after a dozen hour of coding, a custom pull parsing implementation:
  1. It's now as fast as MS XmlTextReader:  first version wasn't :) too much function calls is hitting very hard c# performance (8 times slower than the original !), solution is to pack the whole state machine into a single function: now it's easily indexing  50 Megs of xml per second.
  2. It's able to give file stream position information about each element start and end: during some initialization pass, an indexer stores each record key and location values in a dictionnary, for further fast access.
  3. When GFE needs a particular record info, the dictionnary is requested for the location of the fragment which is read from huge file, and loaded  into memory to be thoroughly parsed. Apparently it's fast enough for 100 random requests per second. It's 100 times enough :)
  4. there is no 4.
This 'works' and fits a particular situation for a particular embedded application: no memory stress, no alien file creation. Currently it's only limited by a minimal support of character types (it's only handling ANSI+unicode).
But definitely it looks like an achievable way to go if you feel somewhat embarassed with existing implementation.
Below is the code source of the main state machine function: contact me for more implementation details.
public TokenType Next(TokenType notifs)
{
_insideElement:
    if (_currentType < TokenType.END_ELEMENT)
    {
        for (; ; )
        {
            //attribute
            switch (_ioBuffer[_bufferIndex])
            {
                case ' ':
                case '\t':
                case '\r':
                case '\n':
                    _bufferIndex++;
                    continue;
                case '>': //end of start element
                    _bufferIndex++;
                    _depth++;
                    goto _beyondElement;
                case '/': //empty element
                    _bufferIndex++;
                    if (_bufferIndex + 1 > _dataLen)
                        PrefetchIO(1);
                    if (_ioBuffer[_bufferIndex] == '>')
                    {
                        _bufferIndex++;
                        _tokenEndOffset = _bytesParsed + _bufferIndex;
                        _currentType = TokenType.END_ELEMENT; //EMPTY ELEMENT ! it starts and ends on the same token
                        if ((notifs & TokenType.END_ELEMENT) != 0)
                            return _currentType;
                        goto _beyondElement;
                    }
                    return DoExpected(">");
                case '\0':
                    if (!FillIOBuffer())
                        return TokenType.END_OF_STREAM;
                    continue;
                default:
                    _currentType = ProcessAttribute((notifs & TokenType.ATTRIBUTE) != 0);
                    if ((notifs & TokenType.ATTRIBUTE) != 0)
                        return _currentType;
                    continue;

            }
        }
    }
//
//beyond element: data, or comment, or pi, cdata
_beyondElement:
    for (; ; )
    {
    _start:
        switch (_ioBuffer[_bufferIndex])
        {
            case ' ':
            case '\t':
            case '\r':
            case '\n':
                ++_bufferIndex;
                continue;
            case '\0':
                if (!FillIOBuffer())
                    return (_currentType = DoEndOfStream());
                continue;
            case '<':
                _tokenStartOffset = _bytesParsed + _bufferIndex;
                ++_bufferIndex;
                for (; ; )
                {
                    switch (_ioBuffer[_bufferIndex])
                    {
                        case '!': //comment, cdata, doctype
                            _bufferIndex++;
                            if (_bufferIndex + 7 > _dataLen)
                                PrefetchIO(7);
                            if ((_ioBuffer[_bufferIndex] == '-') && (_ioBuffer[_bufferIndex + 1] == '-'))
                            {
                                _bufferIndex += 2;
                                ProcessComment();
                                goto _start;
                            }
                            if ((_ioBuffer[_bufferIndex] == '[') && (_ioBuffer[_bufferIndex + 1] == 'C') && (_ioBuffer[_bufferIndex + 2] == 'D') &&
                                (_ioBuffer[_bufferIndex + 3] == 'A') && (_ioBuffer[_bufferIndex + 4] == 'T') && (_ioBuffer[_bufferIndex + 5] == 'A') &&
                                (_ioBuffer[_bufferIndex + 6] == '['))
                            {
                                _bufferIndex += 7;
                                ProcessCData();
                                goto _start;
                            }
                            if ((_ioBuffer[_bufferIndex] == 'D') && (_ioBuffer[_bufferIndex + 1] == 'O') && (_ioBuffer[_bufferIndex + 2] == 'C') &&
                                (_ioBuffer[_bufferIndex + 3] == 'T') && (_ioBuffer[_bufferIndex + 4] == 'Y') && (_ioBuffer[_bufferIndex + 5] == 'P') &&
                                (_ioBuffer[_bufferIndex + 6] == 'E'))
                            {
                                _bufferIndex += 7;
                                ProcessDocType();
                                goto _start;
                            }
                            return DoUnexpected("-");
                        case '?': //pi
                            _bufferIndex++;
                            ProcessPI();
                            goto _start;
                        case '\0':
                            //--- fill data
                            if (!FillIOBuffer())
                                return (_currentType = TokenType.END_OF_STREAM);
                            continue;
                        case '/': //end element 
                            _bufferIndex++;
                            _currentType = ProcessEndElement((notifs & TokenType.END_ELEMENT) != 0);
                            if ((notifs & TokenType.END_ELEMENT) != 0)
                                return _currentType;
                            goto _start;
                        default:  //new element
                            //(notifs & TokenType.START_ELEMENT) != 0);
                            _currentType = ProcessStartElement(true);//always in full as element name may is extremely usefull to get
                            if ((notifs & TokenType.START_ELEMENT) != 0)
                                return _currentType;
                            goto _insideElement;
                    }

                }
            default: //element data. (probably.)
                _currentType = ProcessData((notifs & TokenType.DATA) != 0);
                if ((notifs & TokenType.DATA) != 0)
                    return _currentType;
                goto _start;
        }
    }
}

11 May 2010

GFE: Game FrontEnd (emu frontend)

Hey,

I spent the last months fighting/juggling/coding with WPF. Not the easiest hobby in town, should i mention, anyway: this is GFE.

The idea behind the project is to be able to... Play. No fuss, and more generally minimal settings, to be able to list and launch a vast collection of games: emulated, freeware, flash, choice is vast.

Basically GFE is listing a directory of... things (roms, zip, pics, you tell), probably thousands of them.

Background scanners are trying to match every entry with snapshots and movies (furtherly stats and extra infos from web). And apart from primary enumeration, everything else is asynchronous for a smoother experience.

All settings are guessed from a directory structure passed as application argument or preferably from a gameinfo.xml file.

It's using low level inputs, preferably a joystick, enabling some interaction with GFE running in the background. And it's a WPF project: so all the frontend rendering is done through DirectX and accelerated hardware if available.

OK now, this is the very first version of GFE, and, well: it's usable, but probably not versatile or cute enough. I'm working on that: you can help with suggestions, money, or simple greetings.

Binary:  GFE-0.5.0.zip (45 Ko)
Technical details:
  Windows - .NET 3.5, works on XP and Seven
  Joystick (XInput compatible) or Keyboard
How to run it:
  Modify included GameInfo.xml and use it as the application parameter.
Note: when in background, both LB+RB button (Ctrl+Back) is killing launched app and bring back GFE to front. LB+Y exits GFE.

alimbourg at gmail.com for any question.

This is GFE Emu FrontEnd (yes it's all animated, and video is playing from the selected game):