Ideas for improving SP

Parser

When !internalCharsetIsDocCharset, need to check that every significant character is an SGML character.

Treat "ISO Registration Number NNN//" public ids specially. Warn if they use designating sequence inconsistently.

Pass non-declared attributes through to application.

Avoid expensive overflow test in stringToNumber when length of number is less then something guaranteed not to overflow.

Allow external character set to be complete character set description.

Maybe distinguish non-SGML characters as separate event even when internalCharsetIsDocCharset.

Supporting caching across multiple runs of parser in single process.

Make Dtd copiable.

?Subdoc parser needs character set for system id (should be system character set).

Recover better from non-existent documents or subdocuments.

Think about entity declarations/references in inactive LPDs.

Don't allow name groups in parameter entity references in document type specifications in start-/end-tags.

With link, don't do a pass 2 unless we replace a referenced entity (what about default entity?).

Options to warn about things that HTML disallows: marked sections in instance, explicit subsets.

Option to warn about MDCs in comments in comment declarations.

Option to warn about omitted REFC.

Check that names of added functions are valid names in concrete syntax (both characters and lengths). Also need to do upper-case substitution on them?

Recover from nested doctype declaration intelligently.

Recover from missing doctype declaration intelligently.

Could optimize parsing of attribute literals using technique similar to extendData().

attributeValueLength error should give actual length of value.

Recover better from entity reference with name group in literal.

At start of pass 2 clear everything in pass1LPDs except entity sets.

Give an error if EXPLICIT > 1 and LPDs don't chain as required by 436:5-7 and 436:18-20.

Handle quantity errors by reporting at the end of the prolog and the end of the instance any quantities that need to be increased.

Make noSuchReservedName error more helpful.

Function characters should perform their function even when markup recognition is suppressed. (I think I've handled this.)

Give a warning for notation attribute that is #CONREF.

Try to separate out Parser::compileModes().

In CompiledModelGroup have vector that gives an index for each element type that occurs in the model group. Then in each leaf content token have a vector that maps this index to a LeafContentToken *, if there is a simple transition (no and groups involved) to that element type.

MatchState::minAndDepth and MatchState::AndInfo should be separated off info object pointed to from MatchState; pointer would be null for elements with no AND groups.

What to do if we encounter USELINK or USEMAP declaration after DTD in prolog? Should stop prolog and start DTD. If we have SCOPE INSTANCE then if we get an unknown declaration type in prolog, don't give error, but unget token and start instance.

?Have separate version of reportNonSgml() for case where datachar is allowed.

Implement CONCUR.

AttributeDefinition constructors should have Owner<DeclaredValue> &, arguments to avoid storage leaks when exceptions are thrown.

Create a list like IList but which keeps track of length. Then combine tagLevel into openElement stack, and inputLevel into inputStack.

AttributeDefinition::makeValue should return ConstResourcePointer<AttributeValue>.

Syntax member functions should use reference for result.

Have a LocationKey data structure that can be used to determine the relative order of locations in possibly different concurrent instances. Contains: offset in document instance; is it a replacement of named character reference; for each entity and numeric character reference: location in entity and index of dtd in which instance is declared.

On systems with fixed stacks, avoid unlimited stack growth: hard limits on number of SUBDOCS and GRPLVL.

With extendData and extendS don't extend more than some fixed amount (eg 1024), otherwise could overrun InputSource buffer on 16-bit system.

Have a location in ElementType saying where the first mention of the element name was. Useful for giving warnings about undefined elements.

How to detect 310:8-10?

AttributeSemantics should return const pointers rather than ResourcePointer's

Rename Parser -> ParserImpl SgmlParser -> Parser Syntax::isB -> Syntax::isBlank

What mode should be used for parsing other prolog after document element?

Flag out of context data.

Provide mechanism to allow character names to be mapped onto universal character numbers.

Provide mechanism to allow specification of wbat characters are control characters (for the purposes of SHUNCHAR controls).

With SCOPE INSTANCE, which syntax should be used for delimiters in bracketed text entities?

Better error messages for ambiguous delimiters.

Do we need both EndLpd and ComplexLink/SimpleLink events?

What to do about 457:19-21?

Rename lpd_ to activeLpd_; allLpd_ to lpd_.

Test for validity of character numbers in syn ref charset (perhaps unnecessary, because bad numbers won't be translateable into doc charset).

Option to read bootstrap character set from entity.

In AttributeDefinitionList have a flag that is true if any checking of unspecified values in attribute list is needed (ie CURRENT, REQUIRED, non-implied ENTITY, non-implied NOTATION). In this case can avoid running over attributes in AttributeList::finish, by computing value only when user calls Attribute::value().

Construct link attributes from definition if no applicable link rule. (RAST maybe doesn't want this. Make it a separate method in LinkProcess and use in SgmlsEventHandler. Very useful with ArcEngine.)

Shouldn't have OpenElementInfo in Message. Instead use RTTI.

noSuchAttribute: include gi in message; if element is undefined, don't give error at all

noSuchAttributeToken: say what element or entity

nonExistentEntityRef should say document/link type

Distinguish errors that are totally recoverable.

Find better way to unpack entity information in entity attribute.

Entity Manager

Build document<->internal translation tables once per document not once per entity.

Avoid document<->internal translations when one is the subset of the other (or something like that).

In cases where it won't cause problems, don't translate non-SGML/unrepresentable characters when doing document<->internal translations, so that user gets better error message.

Recover better from unknown document character sets (shouldn't report them as non-SGML characters).

Maybe need to keep track of set of SGML characters as numbers in document character set.

Optimize TranslateDecoder where underlying codingSystem is identity by using simple lookup table.

Make use of charset parameter in MIME header for HTTP. Also generate AcceptCharsets line in request.

Implement .mim files (if extension of file is same as environment variable SP_MIME_EXT assume it has a MIME header).

Avoid using TranslateCodingSystem when translation is a no-op.

Make SP_CONVERT when !SP_MULTI_BYTE.

Avoid requiring that BASE sysid exist.

When FSI has only a single storage manager and that is a literal, return an InternalInputSource.

Allow user of InputSource to specify what bit combinations they want to see for RS and RE.

Have environment variable SP_INPUT_BCTF that overrides SP_BCTF for input.

Avoid using numeric character references for all characters in storage object identifier of literal storage manager in effective system identifier.

Instead of registering coding system pass CodingSystemKit that can create that can create coding systems.

Need BCTF entry in catalog that specifies default BCTF.

Allow encodings to be externally specified (eg in a catalog) as a combination of a BCTF and a character set.

An SOEntityCatalog should consist of a Vector<ConstPtr<EntityCatalog> > which can be shared between several catalogs. This would facilitate > caching.

Maybe need to be able to specify two types of catalog entry file: one used for all documents; one used for this document alone.

Allow end-tags in FSIs. Support alternative SOSs.

Character sets in the catalog need rethinking. Also character set of ParsedSystemId::Map::publicId.

Allow for HTTP proxy.

Cache catalogs.

Use Microsoft ActiveX (formerly Sweeper) DLL on Win95 or NT.

Implement DTDDECL catalog entry.

Support FILE URLs.

Perhaps don't want to do searching for catalog files (and perhaps command line files).

Provide mechanism for specifying when (if at all) base dir is searched relative to other dirs.

Provide extension to catalog format to distinguish entities declared in non-base DTDs. Perhaps precede entity name by document type name surrounded by GRPO/GRPC delimiters.

URLStorageManager should use a DescriptorManager shared with PosixStorageManager.

URLStorageManager::resolveRelative should delete "xxx/../" and "./" components. Might also be a good idea to resolve host names.

Implement JIS encoding system (what should be done with half-width yen and overbar in JIS-Roman? translate to Unicode).

ExternalInfoImpl::convertOffset: when the position is the character past the last character and the last character was a newline, line number should be number of lines + 1.

Try harder to rewind in StdioStorageObject.

Generic

Provide mechanism to access data entities using generated system id.

Support IMPLICIT/SIMPLE LINK.

Character set information.

Need to know space character that separates token. Alternatively provide broken down view of tokens.

Need to know IDREF (and other declared values)?

nsgmls

Problem with "\#n;" escape sequence is that it might get used other than in data. Probably should get rid of this feature, and give a warning when there's an unencodable character.

Internal

Make sure all files use #pragma i/i.

Get rid of assumption that Vector<T>::size_type, String<T>::size_type is size_t.

Maybe align Owner with auto_ptr.

Get rid of uses of string as identifier.

?Maybe support non-const copy constructors for NCVector/Owner.

Get rid of asEntityOrigin (as far as possible). Make InputSourceOrigin::defLocation virtual on origin. Avoid excessive use of asInputSourceOrigin.

Hash should define Hash(String<unsigned char>), Hash(String<unsigned short>) etc.

Invert sense of SP_HAVE_BOOL define.

Get rid of OutputCharStream::open. Instead have OutputCharStream::setEncoding. (Perhaps make a friend so we can use ostream if we're not interested in encodings.) Allow use of ostream instead of OutputCharStream. Change ParserToolkit::errorStream_'s coding system when we change the coding system.

Support 32-bit Char. Need to fix XcharMap and SubstTable. Detemplatize SubstTable. Then support UTF-16.

Have a common version of Ptr for things that have a virtual destructor.

Have a common version of Owner for all things that have a virtual destructor.

Inheritance in AttributeSemantics unnecesary.

Rename ISet -> RangeSet.

ISet and RangeMap should use binary search.

Better hash function for wide characters.

OutputCharStream should canonically use RS/RE and translate to system newline char with raw option that prevents this.

Avoid having Entity.h depend on ParserState, perhaps by double dispatching.

Add uses of explicit keyword.

When generating message.h file; if we don't have .cxx file and namespaces are supported, use anonymous namespace.

Application framework

Only use static programName for outOfMemory message.

Need to use AppChar *const * not AppChar ** in CmdLineApp.

When reporting message with MessageEventHandler need to be able to update error count.

Option argument names need to be internationalized.

Support response files for DOS.

Sort options in usage message.

StringMessageArg should be associated with a character set (in particular, need to distinguish parser character sets from StorageManager character sets).

Should translate StringMessageArg from document character set to system character set. Have MessageReporter::setDocumentCharacter function.

In MessageReporter, maybe distinguish messages coming from the parser.

Don't ever give a non-existent file as a location in a error message.

Text of messages should be able to specify that an open quote or close quote should be inserted at a particular point.

When outputting a StringMessageArg translate \r to \n.

Make sure wild cards work in VC++ and MS-DOS.

Win32

Remove path and extension from program name in error messages.

Compilers can typically eliminate unused templates. Reengineer Vector to reduce code size with such compilers.

Store messages in resources; requires numeric tags for messages.

Should automatically register all available code pages.

Make use of IsTextUnicode() API.

Have StorageManager that uses Win32 API directly. Would avoid limits on number of open files. Also use flag that says file is being accessed sequentially.

Allow DTDs to be compiled into binary by having storage manager that uses resource ids.

Architecture engine

Should give an error with -A if the specified arch does not exist.

Interpret APPINFO parameter, and automatically enable architectural processing based on this.

Handle derived architecture support attributes.

When doing architectural processing in link type, not possible to have notation declaration, so need some other way to specify public identifier for architecture.

Allow DOCTYPE to be declared inline (as with CONCUR or EXPLICIT LINK).

Grok conventional comments.

Make work automatically with EventHandlers that process subdoc. Make references to subdocs architectural.

Support different SGML declaration for meta-DTD.

Maybe should map internal sdata/cdata entities to copies in meta-DTD.

Perhaps when getting open element info should indicate that gis are architectural.

Think about references to SDATA entities in default values in meta-DTD.

Add default entity from real DTD to meta-DTD.

Tokenize ArcForm attribute appropriately.

Make special case for parsing DTD when entity can't be accessed.

Try to provide extension that would allow architecture elements be asynchronous with actual elements? This would provide CONCUR functionality.

sgmlnorm

Avoid bogus newline from invalid empty document.

Avoid always escaping >.

Option to say whether to use character references for 8-bit characters.

Option to output implied attributes.

Option to output all non-implied attributes.

Option to omit attribute name with name tokens.

Protect against recognition of short references.

Option to preserve CDATA entity references.

Option to output general entity declarations in DTD subset (but what about data attributes)?

spam

Option to normalize names.

Add comments round expanded entities to prevent false delimiter recognition.

Add newline at the end if last thing was omitted tag.

Option to warn about changes in internal entities when not expanding.

Documentation

Error message format.

<catalog> FSI tag.

James Clark
jjc@jclark.com