It has to track current parser position within the document and, according to that position, handle notifications differently. SAX handler receives notifications from SAX parser and translates those notifications into commands or data structures that can be understood by the rest of the application. This task can be accomplished with automata based approach. It is important to understand the reasons why writing SAX handlers is so complex, and to eliminate those reasons. įor other languages SAX handler structure is essentially the same. Handler calls startElement method when it encounters start tag, endElement is called for end tag, and characters - for character data between tags. Void characters(char ch, int start, int length) Void endElement(String uri, String localName, String qName) Void startElement(String uri, String localName, String qName, Attributes attrs) Handler class in Java extends class DefaultHandler and typically overrides the following three methods with hand-written code: Crafting SAX handlers may be very hard task in case of complex document structure. SAX handler must be written by hand for each document type that needs to be processed. SAX parser implementations exist for many programming languages as part of standard library or as a 3rd party library.
SAX parser sequentially reads input document and notifies SAX handler about every start and end tag, as well as about character data between tags. SAX is a low-level XML parsing technique.
Handler definition in this special language is automatically translated into finite state machine, and then into source code in any programming language. A declarative language for XML handler definition is introduced.
This paper describes an approach which significantly simplifies development of SAX handlers. The major SAX drawback is complexity of crafting SAX handlers manually. Other well-known approaches to XML parsing like Document Object Model (DOM) and Java API for XML Binding (JAXB) need to load the whole document into RAM, which is unacceptable. How can one parse such document and extract some useful information from it? The only feasible approach for documents like Wikipedia dump is Simple API for XML (SAX), because SAX parser is very effective about computer resources and passes XML content to the rest of application in small portions. a complete Wikipedia dump including all articles with their change history takes 148 Gb (bzip2-compressed). Some XML documents are very large: up to hundreds of gigabytes. More and more applications require XML support, and programmers often have to develop code that extracts data from XML documents. XML is widely used for a variety of purposes: from storing program configuration to transmitting data packets over the Internet.
This approach reduces the complexity of SAX handler development by eliminating the greater part of error-prone manual work. An algorithm is introduced for automatic transformation of such handler descriptions into finite state machines, and then into source code. This language allows to describe complex XML parsing algorithms in a simple manner. Petersburg State University of Information Technologies, Mechanics and this paper a declarative language for SAX handler definition is proposed. Declarative Language for SAX Handler Definition