Page 1 of 1

Documentation generator - project proposal

Posted: Wed Oct 21, 2009 11:55 am
by alex.barylski
I have this thing about keeping methods really terse, like rarely over 25 lines of SLOC, on average they probably fall around 10-15 lines. I find this keeps code self-describing, despite increase indirection. For this reason, I absolutely despie writing docs inline using phpDocumentor, comments for this purpose I find convolute source code, which for the most part is already self-describing.

Just following method calls will quickly give you an understanding of the interactions of a sub-system.

I do comment, however I keep the comments very brief -- no more than a line.

1. Comment caveats
2. Comment side effects

The latter I try and reduceéminimize by keeping methods-objects as state-less as possible -- a practice I picked up from reading up on functional programming.

There are still situations where a few comments are needed though, just to help clarify a section of 3-4 lines which do something complex but do not warrant being refactored into a method.

Code: Select all

 
// NOTE:
// 1. Prefix with request scheme and replace domain placeholder with request domain and trim trailing slash (if any)
// 2. Extract all named markers/placeholders
$format = trim(sprintf('%s://%s', REQUEST_SCHEME, str_replace('{*}', REQUEST_DOMAIN, $format)), '/');
preg_match_all('/\{([a-z]+)\}/', $format, $markers);
Without the comments I would find the two lines confusing and although nothing is likely to change in them, having the high level of commentary quickly tells me what the lines are doing; Entirely subjective I know but documenting using comments is an art as :P

Those are literally the only comments you will see in my code, no phpDocumentor, etc.

If I need a listing of classes, methods, etc I run Doxygen which extracts that info from the source code.

---

Here is what I am thinking (and proposing for a new project).

Ideally I would love to keep documentation external to the source (despite decades of others preeching bad practice):

1. I find coming back to code at a later time and documenting, much more elaborate and accurate. Writing comments while trying to solve a problem tends to result in quick descriptions which make sense at the time of writing but don`t translate well when reading as a new comer to a codebase. Whereas documenting code you have to step through to understand, forces you to comment from a perspective of a naive developer.

WHat I am thinking, is having the documentation written in a makrup such as markdown, which can be easily converted into HTML, maybe PDF, etc.

Each class would map to it`s own folder and each method to it`s own markdown file. Using a file structure-system to resemble something of a table of contents, so full PDF documentation can be easily built, or perhaps HTML output.

Having external documentation raises one interesting problem, that is the issue of synchronization between docs and interfaces (implementation docs I avoid like the plague, my reasoning is, if your tinkering with implementation, you will need to step through the code line by line and figure it out manually anyways, so gentle spinkling of line comments should suffice with self-describing code).

What I am thinking is, implement a parser to scan the source tree of a project and particularly methods. If the signature changes or the implementation, you would raise a red flag so the next person to visit the doc manager would be notified and could synchronize the docs to the new implementation. This solution raise a couple of interesting problems:

1. Whitespace. Changes made to whitespace should not raise flags
2. Variable naming should not raise red flags

Only interface, structural changes, re-factorings, etc should possibly notify the documentation writer that the docs do not accurately reflect the new implementation.

One might re-factor a method (factoring out a section of code) and introduce another private method, but not change the purpose of the method at all, in which case it would be up to the doc writer to determie whether changes were made.

The important thing is devising a system so that changes made to any method (outside of white space, variable names changes) raise a red flag, perhaps by comparing method interface-implementation to a previous MD5 (with whitespace removed and all variables renamed to a single name).

I find it so time consuming writing docs inside the phpDoc blocks, not having any WYSIWYG or similar editor, ot to mention the convoluted feeling I get when sifting through dozens and dozens of lines of comments -- which many times is equally unsycnronized with the intent of the source code.

I`m thinking build the system as a series of CLI scripts (no framework, MVC, just input-process-output) and later build a web based interface or possibly invoke a master CLI script through Eclipse to generate HTML or PDF documentation.

What do you feel of this idea...would you be interested in possibly collaborating a few hours a week and begin implementing something prototypical, re-factoring as we go until we have something concrete.

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Wed Oct 21, 2009 5:15 pm
by Christopher
I don't find inline documentation very useful except in cases where you are really doing something that is not obvious. And even then the comment should be more about why it is not obvious than how it is not obvious.

I do like the idea of a documentation system that keeps things separate from the source. I would like a system that does the tedious organizational work for you by guiding you though creating the documentation. Imagine a system that you just point at your source tree and it builds a list of files and starts asking you to document them. You tell it what/how you want to document and it gathers the information from you. You just fill out forms. Once you had everything you wanted documented (you could skip files) then it would only ask if the source file was newer than the documentation -- indicating that there may be changes that need to be documented.

It could also associate documentation with directories as well (if you want) as files so you can document components.

I usually have a limited amount of time for documentation, say a half an hour. I want something that will remember where I left off and make it easy to create/edit documentation.

Re: Documentation generator - project proposal

Posted: Wed Oct 21, 2009 6:20 pm
by alex.barylski
Imagine a system that you just point at your source tree and it builds a list of files and starts asking you to document them
Yes, exactly. I'm thinking it would scan the source tree, build a file structure with plain text files using markdown syntax, the file structure might mirror the source structure, in fact it would make more sense if classes followed a PEAR/Zend class name convention.

As the source changes or methods are added, text files are as well, etc.

Code: Select all

I usually have a limited amount of time for documentation, say a half an hour. I want something that will remember where I left off and make it easy to create/edit documentation
I hate writing docs...sometimes I'm in the mood...it takes a different chain of thought than writing code, even clear, high level code. I really like coming back to code a few weeks later and commenting or documenting at that time, as I have forgot the caveats and must re-discover any gotcha's and in the process have amuch clearer idea as to how to describe those situations to another person.

The ideas are running wild as of now, but I would love a system that could tie all forms of documentation into a single package. Possibly generate flow charts from implementations, make maintaining specifications, timelines, etc all centralized and syncronized.

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 1:25 am
by Christopher
I think it needs to start really simple to test the idea. It needs to be able to scan a directory tree and identify: the project, directories, files, classes and methods/functions. The it needs to be able to ask for docs for each thing and store it. If it could do that -- even just free form text input for each (could be markdown) -- that would be a start. Once you have that you could think about how to output that content. And you could start thinking about asking for specific information (fields) for each type of thing, because the project, directories, files, classes and methods/functions each needs a specific set of information.

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 2:07 am
by alex.barylski
Starting simple is always good :)

I was thinking or trying to decide, whether:

1. Provide external implementation docs
2. Provide external interface docs

When detecting whether the interface docs (API docs) need updating or not, is it enough to just check the signature of the method. If that changed,the obviously an update for the API docs would be required, although the simple change of a type from an array to an object could have huge impact on implementation and overall focus for a method, so maybe some use of type hinting could be assumed.

I am leaning towards using reflection, which would work on a codebase that followed PEAR or Zend file structure, however tokenizing would be required to work universally I believe. If we coul use reflection to extract the body or implementation of a method from a source file, you could then easily tokenize that, strip whitespace, and rename all variable identifiers to some common name like $temp. Strip comments and compare that MD5 hash to a previous version.

Any change other than that would red flag the system to notify the doc writer to update the interface docs and implementation docs.

Just looking at the php docs here: http://ca.php.net/manual/en/class.reflectionmethod.php

Using reflection would really simply the implementation, only the last time I tried, I encountered several *cannot redeclare classs* type errors from various classes having the same interface name, which is totally avoidable if you follow Zend class to file path convention. :)

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 2:11 pm
by Christopher
I think having a distinction between API and other docs is a good idea. Should the other docs be user definable, so you could specify "Reference" or "Tutorial" or "Quick Start" external docs? It would prompt you to create/edit text for each type you specify. The default might just be "Reference" only.

Also in my list above (project, directories, files, classes and methods/functions) only classes and methods/functions need API docs, so that should be specific to them.

And I agree, using reflection would make it really simple. You can even get comment blocks that start with /** using reflection.

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 3:50 pm
by alex.barylski
I think having a distinction between API and other docs is a good idea. Should the other docs be user definable, so you could specify "Reference" or "Tutorial" or "Quick Start" external docs? It would prompt you to create/edit text for each type you specify. The default might just be "Reference" only.
Agreed.

I'm wondering if it would be a good idea to include tests in the documentation for a given class, as a way of showing how the code is supposed to work, or whether trivial examples are best.
And I agree, using reflection would make it really simple. You can even get comment blocks that start with /** using reflection.
So long as the classes follow a directory naming convention, reflection should work (assuming each file has only one class). Reflection is crazy powerful, would allow complete API documentation generation with minimal effort. :)

The problem as I see it goes something like:

1. Run CLI script to scan source tree
2. Check whether file is PHP extension and determine whether file contains functions or classes
3. Include the class and determine it's name
4. Use reflection to reverse engineer class methods, etc

The hard part is distinguishing PHP template files from PHP source files, etc. My templates for instance, have a PHP extension, which could be renamed .php.xhtml but that would be cheating the system I think, as not every project will have templates that use that extension.

Might need to use some custom hacking to extract code within a <?php source blocks and run the tokenizer on the file to determine if any classes or functions exist, at which point, save blocks to a temp file, include it and execute reflection on the classes/functions???

Namespace collisions with class names and methods/functions will be the greates PITA when using reflection.

Extracting the PHP source blocks will be the most difficult challenge and hardest to get right (while efficient):

1. <? or <?= --- the later we can probably ignore for generating API docs but would still be handy to extract incase we ever want that information
2. <?php or <% --- the latter can probably be ignored
3. <script>

The last one is most difficult to extract because of the various attributes possible within the <script> tag that are possible and only ocncerning ourselves with language="php".

Perhaps everyone interested should implement the best PHP block extractor they can, then we can compare notes and implementations and hopefully discover bugs, caveats, etc. Merge the implementations into one, so we at least have a solid base to start from. Using the reflection API would be trivial compared to implementing something like this.

I wonder if all these can be accomplished using regex? That would be stellar, although slower than PHP implementation and much harder for anyone to tweak or fix. So do we even bother with regex or stick with a standard scanner?

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 6:29 pm
by Christopher
PCSpectra wrote:
I think having a distinction between API and other docs is a good idea. Should the other docs be user definable, so you could specify "Reference" or "Tutorial" or "Quick Start" external docs? It would prompt you to create/edit text for each type you specify. The default might just be "Reference" only.
Agreed.
Put that in the spec! :)
PCSpectra wrote:I'm wondering if it would be a good idea to include tests in the documentation for a given class, as a way of showing how the code is supposed to work, or whether trivial examples are best.
I think the problem is finding the tests. If the test tree is a mirror of the source tree then maybe it would work. Sounds like Phase 2.
PCSpectra wrote:
And I agree, using reflection would make it really simple. You can even get comment blocks that start with /** using reflection.
So long as the classes follow a directory naming convention, reflection should work (assuming each file has only one class). Reflection is crazy powerful, would allow complete API documentation generation with minimal effort. :)
I don't think they have to follow naming rules. The program that extracts the information via reflection just needs to include the file and and see what new functions and classes were defined because of the include.
PCSpectra wrote:The hard part is distinguishing PHP template files from PHP source files, etc. My templates for instance, have a PHP extension, which could be renamed .php.xhtml but that would be cheating the system I think, as not every project will have templates that use that extension.

Might need to use some custom hacking to extract code within a <?php source blocks and run the tokenizer on the file to determine if any classes or functions exist, at which point, save blocks to a temp file, include it and execute reflection on the classes/functions???

Namespace collisions with class names and methods/functions will be the greates PITA when using reflection.

Extracting the PHP source blocks will be the most difficult challenge and hardest to get right (while efficient):

1. <? or <?= --- the later we can probably ignore for generating API docs but would still be handy to extract incase we ever want that information
2. <?php or <% --- the latter can probably be ignored
3. <script>

The last one is most difficult to extract because of the various attributes possible within the <script> tag that are possible and only ocncerning ourselves with language="php".

Perhaps everyone interested should implement the best PHP block extractor they can, then we can compare notes and implementations and hopefully discover bugs, caveats, etc. Merge the implementations into one, so we at least have a solid base to start from. Using the reflection API would be trivial compared to implementing something like this.

I wonder if all these can be accomplished using regex? That would be stellar, although slower than PHP implementation and much harder for anyone to tweak or fix. So do we even bother with regex or stick with a standard scanner?
I think we can get all the information we need from Reflection. I don't think we need to parse at all because the PHP parser does such an excellent job at that. Nor do I think we need to extract code because all we are interested in is additional external documentation, not pulling source out.

As for templates, we could consider any file without only a class or function in it a template. Or we could allow you to specify the directory names of different types of files, i.e., classes, functions, templates, plain old PHP scripts such as index.php.

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 8:13 pm
by alex.barylski
I think we can get all the information we need from Reflection. I don't think we need to parse at all because the PHP parser does such an excellent job at that. Nor do I think we need to extract code because all we are interested in is additional external documentation, not pulling source out.
Good point. Although what happens if you include two different files but they both contain indentical class names?

Something like:

Code: Select all

Database.php = class Database
mysql/driver.php = class Driver
mssql/driver.php = class Driver
Because driver is only ever loaded once, some developers might make this assumption and only name their classes with limited identifier namespace. This is a non-issue for sources that follow a Zend convention because no two class names should be identical.

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 8:51 pm
by Christopher
PCSpectra wrote:Good point. Although what happens if you include two different files but they both contain indentical class names?
Perhaps the scanner should run as a separate request called by the script that traverses the directory tree. Then it would always start with no classes included.
PCSpectra wrote:Because driver is only ever loaded once, some developers might make this assumption and only name their classes with limited identifier namespace. This is a non-issue for sources that follow a Zend convention because no two class names should be identical
I think (at least) for this first version it should be as dumb as possible. You point it at at directory, it scans it and then starts asking you questions. It it is your first time in it asks project configuration settings (likeMake the programmer tell you what various files are. The first pass might seem overwhelming, but because you tell it what the files are--it does not have to be so smart.

Re: Documentation generator - project proposal

Posted: Thu Oct 22, 2009 9:43 pm
by alex.barylski
Perhaps the scanner should run as a separate request called by the script that traverses the directory tree. Then it would always start with no classes included.
Thats what I was thinking, although...

Just so I understand, your thinking, a master CLI script that invokes a source scanner with the filename of the file to be reflected (for lack of a better word). Once the file is reflected, return the ouput (in some neuteral format JSON?), which the master CLI script would then process/parse and store in memory. Once all files were finished (how do we detect that in PHP shell_exec or similar -- will those methods return or inform us when the last script is completed?) being reflected, we would build up the final documentation as either XHTML or PDF, etc.

Maybe we want to investigate the possibilities of using XSLT and transforming XML output into associated PDF, XHTML, etc??? I think that is what phpDocumentor does.

Re: Documentation generator - project proposal

Posted: Fri Oct 23, 2009 12:38 am
by Christopher
I think the first thing the generator side has to is assemble all these chunks of text into "pages." Once it has done that, it can pass that output to a format specific filter that that creates the final output. For example a HTML outputter might write a HTML file for every "page" and create a home page. Whereas a PDF outputter might create a chapter for each "page" and create a table of contents.

Re: Documentation generator - project proposal

Posted: Tue Oct 27, 2009 9:57 pm
by alex.barylski
So arborint and I have been chewing on the idea of building a new documentation generator, one that uses reflection to simplify/remove parsing from the equation.

So far we have established the system should have at least 5 major components:

Code: Select all

1. FileScanner
2. SourceIncluder
3. SourceReflector
4. SourceCollator
5. DocumentBuilder
This/these are essentially all I anticipate being required (from a high level perspective) in order to build and generate documentation from source code..

In the process of this, we have also discovered several inputs, each of which is used by one of the above sub-systems:

Code: Select all

- Source Path
- Destination Path
- Inclusion Wildcard Paths (*.php, *.php.class, etc)
- Exclusion Wildcard Paths (/core/libraries/SwiftMailer/*, /scripts/*)
- Output Format (PDF, XHTML)
- Assocative array of DEFINE's 
- Array of include files, which might be required but exist outside the Source Path
I believe arborint added a few to this list but I cannot find those in our original discussion (I'm in a rush right now).

In our last discussion he and I had decided it was probably best to flesh out more ideas or abstractions before proceeding with any implementation specific stuff, just yet. For the time, if you care to experiment, you could probably use a recursive globbing function like provided here: http://ca2.php.net/manual/en/function.glob.php#51582

I have begun working on a far more capable implementation of this class, which will essentially return only files were are interested in scanning.

SourceIncluder is the second object of the equation, it's purpose is to 'include' the source file and reflect all it's interesting details properties, etc. I have not yet implemented anything substantial except to experiment with the side effects of doing this. Feel free to begin implementing something interesting, but here are some caveats:

1. Some scripts (my own included) have a test for a DEFINE'd value to prevent the file from ever being invoked directly, usually something in the bootstrap or index.php provides the define such as define('INCLUDE_ALLOW', true); and all included scripts there after check this value is set or return nothing and exit. Arborint and I concluded it's probably best to allow users to provide their own custom includes as part of the configuration process, as well as for another reason soon to be noted.

2. Because you are including classes willy nilly with no suggested order, it's very likely you will open a class for reflection and discover while it's being parsed by the PHP parser, that it relies on a dependency not yet included and thus resulting in an fatal error. I can see two solutions for this problem, but the easiet is letting the user provide a custom inlude in which they might implement an spl_register_autoload method to include required dependencies when this situation arrives.

The alternative, is to iteratively attempt to try and include every other file, until the error goes away, once all files have been exhausted, then fatal error as the source code is not complete. Dependencies which are explicitly included do not suffer this problem, but path resolution errors may exist, in which case the chdir() might need be set first -- perhaps another input parameter???

OK so caveats out of the way, basically how the second class would work is to simply include the source file and determine which class interfaces had been added to the runtime, using code something like:

Code: Select all

 // NOTE: Determine all the classes which have been loaded into PHP runtime environment
  $classes = array();
  foreach($files as $file){
    $classes_before = get_declared_classes();
    include_once(trim($file, '/'));
    $classes_after = get_declared_classes();
 
    $classes = array_merge($classes, array_diff($classes_after, $classes_before));
  }
$classes would then be passed to SourceReflector.

SourceReflector would iterate each included class and begin reflecting all interesting details (this class is essentially a facade around the existing Reflection classes) as it's a single interface for dealing with classes, methods, properties, arguments, etc.

SourceCollator would be the engine behind tying all references togather, building a class hieraechy, etc. Basically building the reporting data.

DocumentBuilder would be the engine for actually generating the docs, whether that be PDF or XHTML, etc.

During our conversation, arborint and I also discussed the possibility of using XML, which at first I disliked, until it occured to me, that we could use the XML as a intermediate format for containing reflected class details. Basically once a class had been reflected and it's internal AST like structure converted to XML, the XML file would serve as an cache, similar to OBJ files in a C/C++ program being cached to speed up compilation next time. This idea we both really liked and it was thus decided to use XML for persisting the reflected data from individual classes; Although I dought it would bother much persisting cross referencing, such as dependencies, inheritence, etc.

This XML output seems to me, would probalby be best situated inbetween SourceReflector and SourceCollator such as a class named SourceCaching or similar??? Actually based the fact the class name SourceCaching doesn't make sense and any other name such as CachedReflection() breaks the consistency of the API naming scheme, leads me to believe it's simple enough to probably keep the caching inside the SourceReflector() object itself???

That being said, I think it needs to be determined, what this XML output would actually look like. Do we use attributes of a tag to represent modifiers, such as private, etc? If memeory serves me correctly, working with attributes was a PITA -- but I could be wrong.

Other than that I think we should probalby start to implement a automated SourceIncluder, using the second technique I suggested above. Although letting a user provide a custom autoload() makes things easier for us, it makes it more work for users. Not sure how an automated SourceIncluder is going to work theres a solution out there somewhere, in fact I might be on to it already. :p

What I would like however is for someone else to start a implementation and begin designing an API that we can establish, so we can move forward.

SourceIncluder
SourceReflector

The latter is not that important yet, however feel free to start on a prototype API.

Just figured I'd post this here as arborint and I discussed keeping this thread going publically in hopes of drumming up interest from others. :)

Cheers,
Alex

Re: Documentation generator - project proposal

Posted: Wed Oct 28, 2009 12:41 am
by Christopher
Good writeup. I think one of the first things we need to decide is what data fields we what associated with the basic things in the system. The list I came up with eariler was: functions, methods, classes, files, directories, projects. I am guessing that this is the information that we will extract from the files and output as XML.

In addition, we discussed allowing this system to collect multiple chunks of information about each of these things (e.g., Reference, Tutorial, Quick Start, etc.).

Re: Documentation generator - project proposal

Posted: Wed Oct 28, 2009 2:02 am
by alex.barylski
The list I came up with eariler was: functions, methods, classes, files, directories, projects
We could probably just take everything reflection gives us as it's all good for docs IMHO (is method public, static, final, etc) even all those little details are interesting when running over documentation I think.

Not sure how you wanted to get project? A name perhaps supplied by an input parameter? There is another input parameter we need :P
In addition, we discussed allowing this system to collect multiple chunks of information about each of these things (e.g., Reference, Tutorial, Quick Start, etc.).
Yup...I'd personally like to see examples, like MSDN documentation (which IMHO is hard to beat) or Win32 API programming bible format, which is absolutely amazing. Chapters are loggicall organized and give a detailed intro into the topic, such as "memory management" then each function is listed alphabetically along with a description, syntax (signature), parameters, return, includes (if any), see also, and finally an example, along with icons indicating which environmentéversion is required, etc.

Version information is about the only thing I think might make more sense being stored inline as meta detail for the function. Maybe a working example. Iève have always thought (then talked myself out of it) but if there was a way to implement unit tests inside comments that would really help me in writing and maintaining them. At some point they could be extracted and run to validate behavior.

I never practice TDD because when I initially build a class, the interface changes so frequently for about the first week, at which point it settles right down and then I feel writing tests are not a waste of time. If however, the code for unit tests were inline, I would much more likely practice TDD strictly, as changes to either could be applied easily. Also, readin the code it becomes easy to see how the system is supposed to work. Probably a bad idea, Ièm just throwing it out there.

How or where you would implement the test environment, is beyond me. I suppose every package (ie: database) would need a test environment setup, etc and each method and class wold have itès unit testing code extracted, saved and executed upon documentation generation...I dunno just saying :)

Cheers,
Alex