Archive | Learn Programming RSS feed for this section

HOWTO Decompose your code into functions

28 Mar

For the sake of keeping code readable and as comprehensible as possible, good functions comport to a few simple rules:

  1. Give it one purpose. And the counting of the purposes shall be one. Not two purposes, not three purposes. Five is right out.
  2. Continuing with the theme of having one purpose, subtasks should be split off into their own functions. How do you identify subtasks? A subtask is a contiguous section of code which uses its own set of variables (give or take a few) and for which you can come up with a name for what that section does. An even better hint that a chunk of code is a subtask is if it is found repeated in multiple places in code (either within the same function or found in different functions).Of course, going too far with splitting off subtasks into their own functions would produce infinite regression, so we have to draw the line at some point: as long as a subtask can be confined to a few simple enough lines at a single place in code, its presence inside the function may be just fine.
  3. Give it a solid name—one which is strictly a verb phrase (or at least an ‘action’ phrase, e.g. ‘fileToString’)—that describes clearly the one thing that the function does. Too many programmers shy away from long names, which is silly with today’s pervasive code completion editors. Better a name be long than cryptic. (The exception to this is when you know the function will be used often in complex expressions, as say, math functions often are; in this case, you may want to keep the name pretty short.)
  4. Keep the logic simple. Whenever you see about 4 or 5 levels of nested logic, you should consider splitting some of the deeper logic off into its own function. (Rather than just counting nesting, a more accurate measure is cyclomatic complexity.)
  5. Keep it short. Once a function gets longer than will fit entirely on your screen, it becomes much harder to understand and work with. The optimum function length for code comprehend-ability is probably around one-third screen height (shorter than that and you get diminishing returns as you get a greater preponderance of functions in return for the greater function simplicity).
  6. Keep down the number of local variables. Good functions rarely have more than several local variables. Now, exceeding this number is rarely a sufficient reason by itself to split a function up into multiple functions, but this symptom rarely arises in isolation, so it is something to watch for. Besides splitting the function up into multiple functions, there’s not much else you can do to treat this symptom.
  7. Keep down the number of parameters. Functions taking more than several parameters are ugly. To avoid parameter preponderance, you can do a few things: a) package the information the function needs into one single package (in the form of an array or record of some kind, depending upon your language); b) put the information in variables external to the function; c) the function simply may need to be split into multiple functions.

These last two principles are the hardest to follow: in our structured languages, decomposing a problem into functions means not just dividing up the labor but also segregating the data, so the trick becomes to maintain access to variables where you need them (without resorting to making every variable global: the whole point of structured programming is for data access to be highly regimented).

The practical solution to this problem—and the problem of decomposition in general—is pretty obvious, but too embarrassing for many programmers to admit: rather than write perfectly decomposed code on the first attempt, it’s much simpler (and more common) to just go with whatever first occurs to you and work from there. Indeed, the pseudocode decomposition process commonly taught in schools is unrealistic, for in practice, code is very rarely any good when you first type it out, no matter how well you plan; exceptions certainly occur, but they typically only do so when you’ve solved essentially  the same problem(s) before; when it comes to tackling a problem you haven’t solved before, you’re destined to get things quite wrong on your first attempt 98% of the time.

So it’s important to embrace refinement as your most basic coding activity. Rather than writing good functions, the real important skill is to rewrite bad functions.

A brief explanation of Java versions

3 Mar

How does one make sense of Java’s version history for learners? The full story is at http://en.wikipedia.org/wiki/Java_version_history, but here’s the brief version:

First be clear that that there is only one Java language—one set of syntactical rules for how to write Java code. This set of rules has grown a few times since Java’s introduction in 1996, but valid Java code written in 1996 will still compile with today’s Java. However, the standard Java libraries have also evolved over the years, and some parts of the library have been deprecated (meaning you shouldn’t use them in new code you write because they are flawed and may be completely removed in future versions), so old code may need to be reworked to use modern libraries in place of deprecated ones.

While there is only one Java language, there are several implementations of the language, meaning there are different compilers, different JVM’s (Java Virtual Machines), and different implementations of the libraries. These varying implementations (mostly) all conform to the Java standards, but some have extra features and some have performance advantages over others. The most widely used Java implementations are from Sun, the originator of Java. Recently, Sun has begun releasing its Java implementation under the GPL (Gnu Public License), a free / open source license.

Java is, in truth, a standard and not just a particular line of software downloads released from Sun. For simplicity, though, we’ll only consider the Sun releases and their terminology, as Sun’s Java is the most popular implementation and the one learners should use.

Standard Edition (SE) vs. Enterprise Edition (EE) vs. Micro Edition (ME)

Java comes in three “editions”, which, strictly speaking, are specifications, not actual implementations of a JVM (Java Virtual Machine) and Java libraries. In practice, though, the terms are most commonly used to distinguish between the three Java implementations freely downloadable from Sun.

  • SE is the baseline JVM and libraries. This is the edition used by most PC end-users.
  • The JVM you get with EE is one and the same with the one in SE, but EE adds a large number of libraries for server-oriented programming. It generally never hurts to download and run Sun’s EE Java distribution, but as a learner, you’ll likely never use the EE libraries. I prefer to browse the SE documentation so I don’t have to wade through EE stuff I never use.
  • ME is Java adapted for resource-constrained (i.e. low memory, low power) devices, e.g. cellphones and PDA’s. Because of their typically low storage and memory capacities, such devices can’t afford to have all the libraries found in SE (and certainly not in EE). Small devices differ significantly from one another and so Sun leaves it up to device makers to offer proper implementations of ME for their particular platform. Sun has an ME reference implementation that runs on PC’s which is used for software development (just because you’re writing software for a cellphone doesn’t mean you want to write it on a cellphone!).

Java Development Kit (JDK) vs. Java Runtime Environment (JRE)

On PC’s, you have the choice of using Sun’s JDK or the JRE. The JRE, meant for end-users, comes in only the SE flavor and contains the JVM (Java Virtual Machine) and the SE standard libraries. The JDK comes in the three editions and includes not just the JVM and libraries, but also development tools, including javac (Sun’s java compiler), so this is what you’ll want as a programmer. If you have the JDK installed, you don’t need to install the JRE separately to run any Java software (though it’s likely your browser will have an embedded JRE of its own). (The term ‘JDK’ is an echo of the more general term ‘SDK’ (Software Development Kit)).

Version numbers

Without getting into the features added by the versions, here at least is the naming history:

  • The original release of Java was 1.0, released in 1996.
  • Then came 1.1 in 1997.
  • 1.2 in 1998 is when Sun split Java into the three editions. Sun began calling this version and subsequent versions Java 2, and so you will see references to J2SE, J2EE, and J2ME (Java 2 Standard Edition, Enterprise Edition, and Micro Edition, respectively). Sun began giving the releases codenames while they were in development, calling 1.2 “Playground“.
  • 1.3 in 2000 (confusingly still called Java 2). Codenamed Kestrel.
  • 1.4 in 2002 (confusingly still cased Java 2). Codenamed Merlin.
  • Sun then decided the version numbers weren’t getting big enough fast enough, so they decided to call the next version, released in 2004, variously 1.5, 5.0, Java 5, or (most egregiously) J2SE 5.0, J2EE 5.0, or J2ME 5.0. Codenamed Tiger.
  • Sun finally drops the Java 2 business, deciding that, from now on, the public name of releases will be Java n while the internal development name will be 1.n.0. In late 2006, Java 6 is released, known internally as 1.6.0. Codenamed Mustang.
  • Planned for release in 2008 is Java 7 (1.7.0), codenamed Dolphin.

So, in summary, the sequence of most used designations goes: 1.0, 1.1, 1.2, 1.3, 1.4, Java 5, Java 6, Java 7.

As for the feature improvements these versions represent, in general, each release saw bug fixes and performance improvements for the development tools and JVM along with additions to and refinement of the libraries. Aside from these behind-the-scenes changes, Java 5 is the one release which significantly added new features to the language itself, including generics, annotations, autoboxing, enumerations, “varargs”, and the “enhanced for” loop. (None of these features have analogs in PygeonJava, as they are all in one way or another inessential conveniences.)

More PygeonJava

28 Feb

Continuation of previous post…

1) No inner classes: no classes as static members, no classes as instance members, no local classes, no anonymous classes. The advantages of these constructs are:

  • Documents that the class is used only by its enclosing class.
  • Allows access to enclosing members in certain cases.
  • More convenient than writing the class elsewhere.

My reasons for omitting inner classes anyway:

  • The concept of a class as an instance member of another class is tricky to explain.
  • The rules about outer member access are strange because of the peculiar way in which inner classes were added to the language.
  • Anonymous inner classes are particularly useful, but fitting a whitespace-sensitive construct within expressions requires strange special allowances.
  • Most importantly, inner classes are for the most part syntactical conveniences.

2) ‘break’ and ‘continue’ labels are in. The labels are declared with ‘label’ on the line preceding the ‘while’ or ‘if’ to which they apply (heh, a bit of experimenting just now told me labels can actually be used not just with loops but with ‘switch’es and ‘if’s; seems labels can’t precede an ‘else’, but they can precede an ‘if’ immediately after an ‘else’).

3) Explicit casting everywhere: no automatic widening casts, no autoboxing, no nothin’. Enforcing this will require inspection of the imported types, for otherwise, automatic casts would be accepted by the compiler. (With all this casting going on, I’m considering abbreviating the casting operator from ‘cast’ to just ‘c’).

4) No Java 5.0 features: no autoboxing, no enums, no generics, no annotations, no covariant returns, no varargs methods. Aside from autoboxing, I find the syntax and subtleties of these features strange enough myself that I can’t see learners faring well with them. Besides, they are for the most part just conveniences. Covariant returns are not allowed because this would break the explicit casting requirement. (On the other hand, learners will shortly encounter some 5.0 features in the library if they look around the docs; of course, PygeonJava has always been designed to allow learners to pick it up ASAP and then drop it ASAP, so learners should be on to real Java before they give the docs a serious look.)

5) Abstract classes are in. Even though an interface is arguably just a limited kind of abstract class, interfaces are in too because of their important role as a kind of pseudo multiple-inheritance. In interfaces, the public and static modifiers on each method must be present as a reminder to learners.

6) No assertions. On one hand, assertions are really useful because they can be toggled on and off by a runtime option, which is very useful, but on the other hand, assertions are essentially just a shortcut for if (!bla) {throw exception;}.

7) As with PygeonC, no for, enhanced for, do-while, switch.

Some PygeonJava rules

25 Feb

In previous posts, I introduced the idea of an alternative syntax for Java designed for educational purposes, which I’m calling PygeonJava. The design is still in progress, but here are some tentative ideas, most of them attempts at reducing learner confusion about the resolving of named things (i.e. identifying what identifiers refer to).

  • Each part of a package name ends in /, e.g. java.lang is written java/lang/ . This makes packages easy to distinguish from object references and type names.
  • No static imports.
  • No default package. All files must declare their allegiance to a package.
  • No default access.
  • Importing multiple types of the same name causes a conversion error (the program that translates PygeonJava to Java I’m calling a ‘converter’). To get around this problem, fully qualify the types you need in the code with the full package name, e.g. java/io/File.
  • Similarly, it is not allowed to import a type of the same name as a type in the current package.
  • Types in java.lang are not automatically imported, even Object. (Is this extreme? Certainly, but it means learners can use a consistent process when they see a capitalized process and want to know where it comes from: if it’s not qualified by its package name and its not in the import list, then it must be in the current file or at least the same package.)
  • In the spirit of explicitness, all calls to the super constructor should be required to be explicit, so you’ll always see super() (or some overloaded variant thereof) as the first line in a constructor, even for a class extending Object.
  • Also in the spirit of explicitness, a class should be required to include at least one constructor, even if it’s just a ‘default’ constructor with nothing but a call to super().
  • All uses of instance fields and methods of the current class must be qualified by ‘this’—except instead of using the word ‘this’, ‘me’ is used instead (mostly because it is shorter but also because I remember finding the semantics of ‘this’ strangely confusing). The problem I mean for this to solve is confusion in distinguishing locals from members and rules about name collisions. Arguably, methods need not be qualified explicitly by ‘me’, but I think it’s more consistent to require it.
  • A concern is that static members are not really part of an object, so referencing statics via object references rather than explicitly via the type name seems wrong because the object is not really being used except as a representative of its type. The stringent thing to do would be to require all references to statics to be done via their type name, not via instances or with the ‘me’ reference. This comes very close to covering all cases, but it’s possible in foo.bar() that foo could refer to a thing of type X or a descendant of type X, and static method bar could have been overridden in the descendant of X. I don’t really see a solution, and I’ll just have to accept this Javaism. I think it’s sensible to advise users to reference statics via the type name as much as possible, especially in favor of referencing them via ‘me’. Seems I was wrong: statics are resolved always at compile time using the type of the reference, not at runtime using the actual type of the object referenced. So it actually is a good idea to require all statics to be referenced via the type name. Unless I’m overlooking something, I guess the reason Java doesn’t have this rule is because it was considered inconvenient.
  • Similarly, inherited members are accessed only via ‘super’, not ‘me’. This eliminates another arbitrary choice for learners, reinforces the concept, and reminds learners where the member comes from.
  • Initialization sections can be simply omitted from PygeonJava, and I’m thinking the same about field initializations. It’s generally a bad idea to write code that relies upon subtleties of the language that few of its practitioners understand, and I feel the execution order rules concerning field initializations, initialization sections, and constructors fall into this category. Better to just eliminate the whole question.
  • A class file has this enforced order: static fields, then instance fields, then constructors, then instance methods, then static methods (or maybe static methods before instance methods). This is essentially arbitrary, but it’s the most typical style I’ve encountered. (I haven’t decided whether to include named inner classes, but if so, they would go after everything else.)
  • Allowing interfaces to include static final fields is quite confusing as it distracts from the main purpose of interfaces.

I’m beginning to reconsider what I said about Java not being as much in need of an alternative syntax as C. Sure, nothing is quite as glaringly flawed as C’s declaration syntax, but things add up, and I’m beginning to remember how confusing I found parts of the language when learning it.

I’m also beginning to realize that, rather than making my converter merely produce workable Java using one-to-one translations, it may need to do some semi-sophisticated analysis to enforce some of PygeonJava’s restrictions. Additionally, it’s quite important for the converter to give good error messages of its own because learners won’t be able to comprehend javac’s messages—professional Java programmers have a hard enough time understanding compiler messages, so I doubt students only familiar with an alternative syntax of the language would fare well.

PygeonJava: is it worth bothering?

21 Feb

The need for an educational alternative syntax for Java is less compelling than for C, assuming, at least, that Java is not the first language learners are being exposed to. Not to say that Java is particularly easy—it’s gotten rather baroque rules, in particular, surrounding inheritance, access, and generics—but it’s not the syntax of these features that is really the problem.

In fact, most of the real danger areas of Java syntax are where the similarities with C’s syntax are misleading. In my own experience of learning C before learning Java, this is the case with Java’s array syntax (and it certainly didn’t help that I wasn’t really straight about C’s array and pointer syntax). Conceptually, Java’s arrays are conceptually quite different from C’s arrays, and the fact that they aren’t allocated on the stack has a number of important consequences, such as allowing array size to be specified dynamically.

Here’s my current idea for array reference syntax:

foo:a-char // declare reference to array of chars

foo:a-char (3-char) // same as above, but initialize to an actual array (note the size has to be supplied)

foo:a-char (bar-char) // size of array is given by expression (the variable bar)

foo:a-char1 // leave first dimension size inferred from number of elements

foo:aaa-Dog // declare reference to 3-dimensional array of Dogs

In the end, Java arrays are still a rather complicated matter and so real understanding is probably only gained by a thorough discussion of what’s really going on in memory.

[...]

  1. add bar 3)-char) // size of array is given by expression (sum of bar and 3)

    foo:a-char (a-char ‘a’ ‘b’ ‘c’) // actual values are supplied and so size of array is inferred from number of elements

    foo:aa-int // two-dimensional array of ints

    foo:aa-int (4-a-int) // create two-dimensional array of ints, specifying the first dimension

    foo:aa-int (aa-int (a-int 3 5) null (4-int []

PygeonC function declarations/definitions and function pointers

11 Feb

I considered many candidates for function declaration syntax. Imagine we want to declare a function named ‘foo’ that returns int and has three parameters, x, y, and z. Here are some of the candidates I considered:

func foo:int x:int y:short z:char

func foo:int params x:int y:short z:char

func foo returntype int params x:int y:short z:char

func foo rtype int params x:int y:short z:char

func foo int params x:int y:short z:char

func foo int x:int y:short z:char

func foo x:int y:short z:char rtype int

func foo x:int y:short z:char returntype int

[hey, just found out you can preserve coloring and basic formatting when copying from OpenOffice into the WordPress text box. Neat trick.]

My current thinking is to use the last form, for a few reasons:

  • Return types are a new concept, as are statically typed parameters. While experienced C/Java programmers should have no trouble thinking of the return type as the type of the function and hence understand the foo:int syntax, I ultimately concluded it’s important to keep the concepts of return type and parameter type syntactically distinct.
  • I considered abbreviating Pygeon’s ‘parameters’ keyword to ‘params’, but because it’s assumed students have learned Pygeon before attempting PygeonC, it makes sense to just get rid of it at this point, as it’s unnecessary extra typing. Besides, I don’t like how it makes functions with parameters look too different from functions without; such non-uniformity makes it harder to identify functions when quickly scrolling past them.
  • On the other hand, return type, being a new concept, could stand to stand out, hence announcing it with a keyword and placing it at the end.
  • I’m thinking that ‘returntype’ is too much to type, but it may still be the best option. I considered ‘returns’, but that’s confusingly close to ‘return’, and I always find imitating English sentences in syntax to be misguided, not a virtue. Abbreviating to ‘rt’ is too cryptic. My concern with ‘rtype’ is that “ARR-TYPE” might enter into the learner’s vocabulary; I don’t like the idea of polluting the already confused lexicon of programming.
  • I’ve argued elsewhere that superfluous syntax is usually just confusing to learners, which suggests ‘returntype’ should be omitted, leaving the return type to just be inferred as the last type given. Like with having the ‘parameters’ keyword in Pygeon, though, ‘returntype’ may just help hammer the vocabulary term into learners’ heads. As long as it’s pointed out to students that the syntax could easily ditch ‘returntype’, I think this is a fair trade off.

(Note the type keywords are highlighted in a different color than the normal blue color. In fact, I’m thinking all type specifications should be highlighted uniformly whether they are built-in type keywords or not.)

Whatever the chosen function definition syntax, the prototype syntax is basically the same but with ‘proto’ in place of ‘func’ and with just parameter types, not names:

proto foo int short char returntype int

In fact, PygeonC doesn’t let you specify the parameter names in a prototype, for having optional syntax that doesn’t even do anything is frankly just confusing, and I consider it a core concept of C prototypes the fact that the compiler disregards any parameter names present. Besides, the utility of including parameter names in C prototypes is having self-documented code, something which is not really a concern for learners at this stage.

You might argue that prototypes could simply start with ‘func’, just like functions, but I think this is precisely the mistake C makes. Sure, function declaration and function definition are conceptually related, but giving them only subtly distinct syntaxes, as C does, makes them initially hard to discern and blurs the conceptual differences in the minds of learners. Generally speaking, subtlety is not a virtue for making things grokable.

The definition/declaration blur is another example of a misguided attempt at conceptual unity in C. I think what’s going on with such misguided conceptual unities is that designers spend their time juggling many parts in their head, mentally banging them together to see which parts fit with which and which overlay the others; the best part of designing comes when you have those ‘A ha!’ moments, where you see how parts you were thinking of as separate can be neatly overlayed, interlocked, or even dissolved into one, greatly simplifying the design. A small minority of the time, these revelatory moments really pan out just like they seem they will in that initial flash of recognition. However, most of these moments come to nothing when it later turns out, upon further reflection, that the idea doesn’t really make sense or fit consistently with the rest of the design. Other times, refactoring the rest of the design to fit the revelatory idea actually makes the whole more complicated. Other times, the idea can be accommodated just fine, but the gain is just an illusion: the designer, unhappy with some trade off, finds a solution that seems to dissolve that trade off, but euphoria blinds the designer to some side-effect introduced by his solution; on any other day, this new problem would displease the designer just as much as the problem he’s just solved, but he just isn’t thinking about it at the moment—maybe he’ll notice in a week or two.

The aspects of C that displease me so, the misguided attempts at conceptual unity, I actually believe these are partly examples of success: after all, the conceptual unities of C ‘work’ in the sense that (obviously) they comprise a real working language and the conceptual unities do in fact reduce the syntax (e.g. pointer and array declaration syntax mirrors the syntax for dereferencing and array indexing). However, these design choices also exhibit the designer ‘euphoria blindness’ I described: aspects of the design have been simplified, but only by incurring disregarded costs elsewhere. In this case, the costs are to language transparency. The design of a language is a kind of design where the transparency of the design is almost as important as its functionality, but the misguided conceptual unities in C are difficult to convey to outsiders because they really only make sense to people who already understand them. Having read many accounts of the C language, I’ve come to the conclusion that many of the traditional stories and vocabulary which C programmers use to talk about the language to each other simply fail to account for what is really going on in the language. Really, this is an unfortunate fact of any area of expertise: the experts are already cognizant of what’s really going on, so it’s fine for communication amongst themselves if their explanations and vocabulary abridge or misrepresent to untrained ears what they’re actually saying.

Anyway, on to function pointers. Let’s say we wish to declare a pointer named ‘bar’ for the function ‘foo’, above. Here’s the syntax I currently have in mind:

bar:[pfunc int short char returntype int]

(Notice the whole type declaration is in a single-color highlighting. Again, ‘pfunc’ and ‘returntype’ could be omitted, leaving the return type to be inferred as the last type specified in the brackets, but I think it’s best to leave those training wheels on.)

Any occurrence of [] always denotes a function pointer type. You can create derived types from function pointers using the usual scheme, e.g. if you wanted a 3-element array of pointers to a function pointer, it would look like so:

bar:3-p-[pfunc int short char returntype int]

Function pointer syntax requires start and end delimiters of course because you might need a pointer to a function that has a function pointer parameter:

bar:[pfunc int [pfunc char returntype p-char] returntype int]

(Above, ‘bar’ is a pointer to a function that takes a function pointer as its second argument.)

I’ve yet to read an introductory C book other than K&R that even mentions function pointers, and I think for good reason considering the crazy way C denotes them. Hopefully the above syntax can correct this.

UPDATE: Upon further consideration, ‘rt’ is the right choice. Also, ‘pfunc’ should just be omitted, as it’s enough to say that every pair of brackets is a function pointer signature. So now function definitions, prototypes, and function pointer declarations should look like, respectively:

func foo x:int y:short z:char rt int

proto foo int short char rt int

bar:[int short char rt int]

Another issue is casting. My first notion was to reuse the declaration syntax, like so:

foo:p-void // cast foo to pointer-to-a-void

…but then I realized this looks odd when applied to a parenthetical expression:

(bar a b):p-void

I also dislike how it gave colons two subtley related purposes. If casting is an operation, best treat it consistently with other operations and have a casting operator:

(cast p-void foo) // cast foo to pointer-to-a-void

What’s the matter with C?

6 Feb

First off, I’ve made a couple more decisions about Pygeon syntax:

  • The assignment operator is the keyword a (standing for ‘assign’), not the symbol = . So an assignment looks like (a foo 3) // assign 3 to variable ‘foo’ . The = symbol in math is not actually an operator, and so it is one part of mathematical notation that confuses the whole issue of expressions. It’s better to sidestep the whole distinction by never associating assignment, the making of equality, with what students are used to thinking of as the declaration of equality. Besides, assignment expressions already bring with them a confusing concept: assignment is a unique operator in that the target operand has to be a so-called ‘lvalue’ (‘left value’, a term coined in C), and in this sense the target operand doesn’t get evaluated as operands normally do.
  • The membership operator (the Pygeon equivalent of both . and [ ]) is the keyword m (standing for ‘member’), e.g. (m foo 3) // return the 4th index of ‘foo’ .

Continuing with the discussion of PygeonC syntax (see previous post):

The biggest pieces of C missing from Pygeon are syntaxes for variable declarations and structure definitions, and these happen to be areas where C could greatly stand to improve, for the sake of both learners and proficient users of C.

The worst decision made in the whole of C syntax is the way that the [ ] and * operators have opposite—but not precisely opposite—meanings in the context of a declaration as they do in an expression, e.g. in a declaration, [] makes something which is not an array into an array, while, in an expression, [ ] makes something which is an array into a constituent value of that array. Worse, the fact that * is overloaded to be both the binary multiplication operator and the unary * dereference operator is just confusing. To begin with, the unary/binary operator distinction is hard for most students to get used to because the only unary operators familiar to them are the – and + signs, which seem like special cases because their meaning as unary operators is logically related to their meaning as binary operators: students generally think of such signs as part of the number, not operators; this is not the case with unary * in C, which actually does something entirely differently from the binary * operator.

Quickly determining whether a * is a dereference or a multiplication is hard enough, but then * might also be indicating a pointer in a declaration, which raises another problem: learning to distinguish C’s declaration statements from expression statements is hard at first. The simple declaration statements—the ones beginning with one of the type keywords—are generally easy enough to quickly identify, but things get ugly in more realistic examples of C code, where typedefs, *, and [ ] are used frequently. Beyond just recognizing declarations, writing and parsing them is tricky: the base type can be modified with a panoply of keywords before or after the base type and the *, (), and [ ] operators using a totally unintuitive precedence. It’s in these complicated declarations that the decision to mix a prefix operator (*) with two postfix operators ( [ ] () ) is shown to be just egregious. In fact, I think it demonstrates that the whole idea of using the idiom of expressions in declarations makes absolutely no sense. I’ve read Dennis Ritchie justify this choice as an attempt at conceptual unity—complex types should be declared like how they are used—but this doesn’t really hold water: in a declaration, nothing is being evaluated. Instructors of C have long noted the trouble students have with C’s complex types, but I suspect the primary fault here lies with the strange syntax which obscures the underlying simplicity. (As much as I dislike this syntax, it severely annoys me how few C tutorials and books fail to emphasize—or point out at all, in some cases—that these are in fact operators with the base type as their operand. K&R is the only book I know that emphasizes this at all, which I think is one key reason it’s still so highly valued.)

So what does PygeonC do about declarations and the reference/dereference syntax?

  • & is replaced with ref (or maybe just r), and dereference * is replaced with dref (or maybe just d). E.g. (dref foo) is equivalent to C’s *foo. Verbose, surely, but clear and consistent with the established syntax.
  • There is no [ ] indexing operator as it is really just a convenience: x[3] is equivalent to *(x + 3). E.g. PygeonC would express x[3] as (dref (add x 3)). Again, this is verbose, but better to lay bare what is really going on. (On second thought, perhaps dref can optionally take a third operand to specify an offset before dereferencing, e.g. (dref x 3) .)
  • Rather than having signed and unsigned modifiers, there should simply be keywords for the unsigned variants, e.g. ‘ulong’, ‘ushort, ‘uint’, etc.
  • My current thinking for the declaration syntax looks like name:type . (I’m not sure about doing the reverse from C, putting the name before the type, but it just appeals to me at the moment.) Allocation modifiers go before the type separated by another colon, e.g. name:static:type .
  • An array modifier is simply the number of elements in the array. A pointer modifier is simply the letter p. The array and pointer modifiers go before the type, separated from it by hyphens, e.g. foo:static:3-p-p-int //foo is a 3-element array of pointers to pointers to ints . (Hyphens must separate all p’s and numbers; it would be perfectly understandable to omit hyphens between adjacent p’s, but for simplicity and consistency, it is not allowed.)
  • Notice that the declaration syntax makes a declaration all one token without whitespace. This means there is no separator necessary between the listed parameters in a function definition and it makes the casting syntax simpler. A cast is simply done by using the type as an operator, e.g. (p-int foo) // cast value of foo to a pointer to ints

(I haven’t even really discussed the problems with function pointer syntax; that will have to wait, as a better solution is not something I’ve cracked yet).

Here are some additional thoughts:

  • The membership operator is used just with structs, not dynamic associative arrays, so its argument is always known either to exist or not exist at compile-time. Because these names are always resolved at compile-time, the member is specified not as a string but as the name directly, e.g. (m foo bar), not (m foo “bar”). Using (m foo “bar”) would imply there is some string “bar” somewhere in the program text, but this is just a name used by the compiler.
  • It occurs to me that students should be made aware of an important conceptual difference difference between operators and functions in PygeonC that doesn’t exist in Pygeon. In C, some operators take a range of types for their operand(s) rather than just one and only one type; this is aside from the implicit casting of the core types, e.g. there is no implicit casting between integers and pointers, yet the + operator works with both integer and pointer operands. In other words, operators can be made to work with a range of types that you just can’t do with functions. Just as important, the type returned by an operator may vary whereas the type returned from a function never can.

Syntax does/doesn’t matter

31 Jan

Syntax doesn’t matter: any good programmer works with multiple languages over their lifetime, most of these languages expressing basically the same ideas in mostly arbitrarily different ways; any serious student of programming will come to the same conclusion once they learn their third or fourth language.

I’ve seen this in another context, trying to learn Arabic. This Slate article explains the difficulties of this undertaking very well, and it points out what I experienced myself: reading the script is the easy part, and actually not all that hard. Excluding the elaborate calligraphies, English-speaking learners of Arabic should only need a month or two to feel reasonably comfortable quickly identifying the characters and reading them chained together. Additionally, it is surprisingly easy to adjust to reading right-to-left: after a short time, the brain simply makes the switch. (Remembering to open books from what seems like the back takes a lot longer.) Programmers develop this same ‘switch’ when working with different syntaxes.

But syntax, of course, does matter on two ends: programmers have to write in a syntax, and learners have to learn a syntax. Really, when people say ‘syntax doesn’t matter’, they mean that programmers—and those learning to program—shouldn’t think in syntax. This is true, for whatever the syntax, the syntax is not going to change the semantics of what the programmer codes, nor is it going to change the core concepts of a language which the student must learn.

My main purpose in proposing a new educational programming language is to give learners a language that expresses the core features common to modern languages in the syntactically most direct and obvious way possible. This principle should be applied to the libraries of the language as well, so for instance, abbreviations that are non-obvious to a non-programmer must be excised from the language, including the libraries, e.g. the language can’t have anything like C’s “stdio” (“standard input/output”) or “sprintf” (“string print formatted”). (Such abbreviations would be more forgivable if full-names were given in learning materials, but out of the dozens of tutorials and references of the C language I have read, only a couple do so.) Taken individually, shortcuts like these may seem to the initiated like small hurdles, but only because the initiated can no longer see how non-obvious such shortcuts are. The small hurdles quickly add up. Every quirk, every historical legacy, every shortcut is one more thing which at some point is going to cause the learner frustration, possibly halting their progress.

To transition learners from Pygeon (my educational language) into C and Java, it therefore makes a lot of sense to give learners intermediary languages, ones which are as free as possible of the quirks, historical legacies, and shortcuts found in C and Java. Call these intermediary languages PygeonC and PygeonJava (calling them CPygeon and JPygeon would imply we are talking about particular implementations of Pygeon, like with CPython and JPython). Syntactically starting from a base of Pygeon, these languages would be directly translatable into valid C and Java (in fact, this would be the simplest and best implementation method).

To give you an idea, I’ll discuss a few of the syntactical foibles of C and how the semantic content behind the foibles might be more obviously (though likely more verbosely) expressed in a Pygeon-derived syntax:

The first question is whether PygeonC would need to introduce the control flow features of C not found in Pygeon (this includes for, switch, do-while). You can program C just as well without these constructs as you can with them, so it seems OK to omit them. On the downside, not including these constructs delays introducing them to students until they encounter real C, but I feel this is the right choice, as PygeonC is meant to introduce learners to the concepts of C, and these constructs arguably are just syntactical conveniences.

Goto and labels, however, aren’t quite just conveniences, as they often open up significantly different ways of expressing some particular logic, so I feel their inclusion is warranted. More importantly, the fine-grained control offered by goto and labels are in the spirit of C—even though their use is best avoided in almost all cases even in real C. PygeonC’s goto statement will look just like C’s, but label’s will be declared with the label keyword, not followed by a colon e.g. label foo. Labels must go by themselves on the line preceding the statement they label.

I’ll discuss more C/PygeonC syntax in my next post, What’s the matter with C?.

Expressions, expressions, expressions

27 Jan

The most common oversight in beginning-programming education is that most instructors and most books fail to emphasize the concept of expressions. From my own learning experience and from talking to classmates, it’s very rare, for instance, for learners to think of the of the function name in a call as the operator and its arguments its operands. Students also fail to grasp the basic idea that all operands are of a type and that an expression, no matter how complex, evaluates into a single typed value.

The benefit to giving learners a thorough understanding of expressions up front is that it then becomes easy in most languages to explain a good number of other rules. For example, when explaining conditional expressions, all you have to say is that it’s an expression whose evaluated value is then interpreted as true or false according to such-and-such rules. You shouldn’t then have to explicitly tell learners that they can include function calls in the expression or (in some languages) assignments. Most importantly, learners can begin to see a conceptual unity in the language.

With this in mind, it occurs to me that Pygeon (my educational language [see previous post]) should use prefix notation rather than the usual infix notation. Pygeon’s prefix notation would differ in a few key ways from that found in Lisp:

  • Pygeon is still statement-based, so control flow (‘if’, ‘while’, etc.) is kept conceptually separate.
  • The outer expression of a statement need not be surrounded in parentheses.
  • Operators will all be designated by name rather than symbol, i.e. the addition operator will be add. Even operators like . and [] are expressed as what appear to be built-in functions, e.g. indexing the fourth member of an array would be written: (member arr 3) .

It follows that, for consistency, function calls should adopt the Lisp prefix style, including the lack of commas between arguments.

The basic benefit to these decisions is that it frees students from having to think about syntax: using named operators and prefix notation may require more parentheses and verbosity than most experienced programmers would want to deal with on an extensive basis, but the advantage for learners are considerable—at the very least, prefix notation frees students from the distraction of an order of precedence. Perhaps more importantly, I think learners can too easily fail to see the conceptual parallel between an operator and its operands and a function and its arguments; the difference between the two really is just that the built-in operators of a language are built-in, use a special syntax for convenience/aesthetic reasons, and are typically implemented differently from function calls to avoid the same overhead.

Of course, learners eventually will move on to languages with the more typical infix notation, so you might object that students shouldn’t be deprived of infix. While I do believe prefix notation lays bare the concepts of expressions in a way that infix does not, I actually think the optimum approach is to present them side-by-side, no matter what the language being taught, for this is an excellent example of where learning more material makes learning easier rather than harder because the concepts are best understood presented in contrast with near-neighbors and alternatives. Prefix notation offers the purer syntax for expressions while infix is the more familiar.

Pygeon: a new educational programming language

14 Jan

One thing I’ve had on the drawing board for a while is a new educational programming language, which I’m calling ‘Pygeon’—pronounced ‘pigeon’, but spelled with a ‘y’ in honor of its Python heritage. The design of Pygeon reflects one key principle: conveniences are confusing for learners, as they cloud the ‘essential design’ of a language with distracting ‘incidental design’. Not only are conveniences unnecessary mental clutter the learner has to deal with, they are like the Kool-aid guy bursting through the walls of the language’s form: they make the form more complex, turning it into a confusing Swiss cheese from which the initiated may be able to still discern the essential form but from which newbies usually can’t.

Now, in a sense, programming languages are themselves ‘conveniences’, so how are inessential conveniences different from essential ones? Well, programming languages are always a form of information compression, e.g. in C, you needn’t write:

int x; int y; int z;

…as you can simply write:

int x, y, z;

Similarly, writing a function is a convenience in that it saves programmers from having to write the machine instructions themselves. The key difference here is that the multiple-declaration syntax convenience just saves you some typing and compacts your code; in contrast, having functions as a built-in abstraction to the language saves the programmer from having to think or even know about machine instructions, not just from having to type machine instructions.

Speaking of functions, Pygeon’s function declaration syntax makes a good demonstration of Pygeon’s adherence to the ‘no conveniences’ rule. A function declaration has this form:

function  name [parameters parameter1 parameter2 ... ]

The body of statements comes indented on the following lines, a la Python.

The optional ‘parameters’ keyword is followed by the (non-comma-separated!) list of parameter variables. Commas are not used to separate the parameter names, for two reasons: first, while C-influenced languages choose to emphasize the symmetry between function-call argument lists and function-declaration parameter lists, I find it’s actually more important to emphasize the distinctions, as one is a list of expressions and the other a list of declarations; second, without type declarations, the commas become unnecessary, and unnecessary syntax confuses learners because they want to believe the syntax serves a purpose they just can’t discern yet.

You may have noticed that the function declaration header does not end with a colon, as Pygeon disallows the body of a block to go on the same line as the header. Nor can multiple statements go on the same line with semi-colons. So whereas, in Python, you can write:

if x > 3: doSomething(); x = 5

…in Pygeon, the same must be expressed as:

| if x > 3
|   doSomething()
|   x = 5

I made these decisions because, in my own learning experience, formatting choices are distracting. Pygeon takes this very strictly and so lacks any mechanism to spread statements across multiple lines. Certainly this is very limiting and not acceptable in a real-use language, but occasional overly long lines are tolerable in an education language and act as a good indicator to the learner that maybe they should split their statement(s) up or that maybe something is non-ideal about a function that takes so many parameters; this also ensures that all Pygeon code only ever uses indentation to introduce a new block, which greatly reduces confusion when reading code for eyes not used to doing so.

Here’s a quick run down of some other design choices in Pygeon:

  • If Python didn’t have ‘elif’, if-else ladders would be a large mess of indentation; reading and writing if-else ladders at one level of indentation is a basic coding skill, so Pygeon includes ‘elseif’, even though it is, strictly speaking, unnecessary and just a convenience. (The C solution, introducing compound statements, is overly clever for something of such specific use, and it brings along strange edge cases that have to be explained, e.g. if I have a single non-compound statement after ‘if’, does that constitute a unique scope?)
  • There is no ‘for’, ‘foreach’, or ‘do while’, only ‘while’. I personally hardly ever find a use for ‘do while’, and though ‘foreach’ is particularly useful, it is just a convenience. Not only do these constructs represent more syntax to explain, they represent distracting choices learners will have to deal with when writing code. Not including them in the language hammers home the idea that only ‘if’, ‘else’, ‘elseif’, ‘while’, ‘break’, and ‘continue’ are needed to construct any possible flow of execution (though ‘elseif’ and, I believe, ‘continue’ aren’t really needed either). For similar reasons, there are no ‘switch’, ‘goto’, or break-to-label constructs.
  • A module’s top-level of code includes, in this order, import statements, functions, classes, and then the setup section (a section of arbitrary code). This reflects the general order in which things happen when a module is loaded: first the imports are executed, then the function and class definitions are parsed and compiled (putting functions first is an arbitrary decision), then the setup code is run. Using this strict ordering is less powerful than with Python style modules in some circumstances, but it cuts off questions about odd edge cases.
  • An ‘import ‘ statement specifies the full name of a file as a relative path (in a slash-direction agnostic manner). The equivalent of Python’s plain ‘import’ is expressed by adding a ‘to’ clause after the file name, e.g. ‘import foo.pyg to bar’; the target specified in the ‘to’ clause is a module-level reference to an associative array object (like a Python dictionary), not a special module object. To achieve the same effect as Python’s ‘from import *’, use the keyword ‘here’, i.e. ‘import foo.pyg here’. (The ‘here’ clause feature will probably just be omitted, as it encourages namespace pollution, and it’s really just another typing saver.)
  • Pygeon has no concept of packages.
  • Pygeon has no print statement, as it provides nothing a function couldn’t do with the simple addition of parentheses.
  • Each file has a module-level scope, but there is no built-in scope. The most basic parts of the standard library are always implicitly imported into every module with ‘import here’, and warnings are issued by the code checker when you reassign these members of the module scope.
  • Functions and classes are always written at the top-level of code (not including constructors and methods which of course go inside their class definitions).
  • Classes in Pygeon have no concept of member access restrictions. Nor, in fact, is there a concept of inheritance or type hierarchy supported in the language—all objects are simply associative arrays that either have some member or they don’t. If you want a class that inherits from another, you’ll have to cut and paste all the parent class stuff into the child. (Obviously this is not acceptable in a real-use language, but it dodges all sorts of otherwise needed bothersome rules, and it emphasizes the true nature of polymorphism: an object can be used in the place of another type of object when it has the necessary members. Besides, it’s a classic beginner mistake to overuse inheritance.)
  • All methods of a class are declared in a ‘class’ block. First goes the optional class setup code (for making static members); second goes the (mandatory) constructor, which is written just like a function but declared with the keyword ‘constructor’ (and given no name); then come the methods, declared with the ‘method’ keyword.
    | class Cat
    |   foo = 3 // static member foo
    |   constructor // an empty constructor
    |   method meow
    |     // do meow stuff
  • Instance members are created via assignment of members to the ‘this’ variable inside the constructor and methods. The instance must always be explicitly specified with the keyword ‘this’, so there is no name collision between local variables of a method and the instance. A method can call another only via this. The class object itself can be accessed in a method by name, and this is how static members are accessed.
  • Pygeon includes exceptions, but there is no Exception type from which all exceptions must derive. The exception mechanism throws an object in which the first element (usually a string) is what the catch blocks look for, e.g. ‘throw ['hi', moreInfo]‘; all other elements of the thrown object are optional and can be used for more information about the error. A catch block header specifies the object to match to (Pygeon first tries an identity match and then tries an equivalence test). The thrown exception object is accessed in a catch block as the keyword ‘ex’ just like it were a variable local to the catch block (obviously ‘ex’ has no meaning outside a catch block). The exception can be repropagated simply with ‘throw ex’. Pygeon has no analog of ‘finally’.
    | try
    |   //...
    |   catch 'hi'  // catches exceptions thrown with 'hi'
    |     print(ex[1]) // print the second element of the exception object
  • When you use a variable in a class, function, constructor, or method, it is taken to be in module scope unless that scope somewhere has an assignment to the same name. The only way to assign to the module scope from within a local scope is via the keyword ‘mod’, a special reference to the current module, e.g. ‘mod.bla = 5′.If you create a local foo, you can still access module-level foo as ‘mod.foo’.
  • After the function and class definitions in the module, you can an arbitrary sequence of statements which is executed when the module is first loaded. Statements in this section execute in the module scope, so they reference the functions, classes, and imported modules of the class directly, and when assigning to a variable here, it is a variable in the module scope.
  • Associative arrays (dictionaries) and indexed arrays (lists) are combined into one kind of object, using [] for the literal syntax, e.g.:
    ['hi', foo:42, 'yo'] // index 0 is 'hi', index 1 is 'yo', foo is 42
  • Unlike in PHP, there’s no weird concept of order given to the non-numeric members of an array, and the indexed members are always accessed via their originally assigned index. Members are accessed with the dot operator, e.g. ‘foo.bla’ (at bla of foo) or ‘foo.3′ (at fourth index of foo); where the index or name of the member is not known at write-time, use a string or integer expression in [], e.g. ‘foo[x]‘ (where x evaluates into a string or integer).
  • Instances are simply arrays, so unlike in Python, Pygeon objects don’t have ‘internal dictionaries’ because they are their own dictionaries.
  • As there is no inheritance, there is no hierarchy of built-in types. The built-in types are: associative/indexed arrays; strings; booleans; and numbers (the language masks the integer/float distinction in the manner of Javascript but with no size constraint). The built-in types do not have any methods because that introduces conceptual difficulties that are hard to explain. Standard library functions provide the missing functionality.
  • There is no casting operator, only conversion functions. Implicit conversions are kept very few.
  • Pygeon dispenses with Python’s operator overloading, as it’s an overly complicated feature. The principle built-in types still retain their most essential operators, but the operators are just built-in shortcuts to standard functions (not methods of the type). There’s no way to use the operators with your own types.

That covers the major highlights. My next post will be about why programming languages are hard to learn, followed by a look at particular languages and what prevents them from being good educational languages.