Saturday, 30 July 2011





Tokens in Java Programs







Tokens and Java Programs









Introduction
In this lecture we will learn about the lowest level of the Java language:
its tokens.
We will learn how to recognize and classify every category of token
(which is like classifying English words into their parts of speech).
Towards this end, we will employ ourly new learned EBNF skilss to write and
analyze descriptions for each category of token.
In later lectures we will learn about a programming language's higher level
structures: phrases (expressions), sentences (statements),
paragraphs (blocks/methods), chapters (classes), and books (packages).







The Family History of Java
Before going on to study Java, let's take a brief look, through quotes,
at the languages on which Java was based, traveling back over 30 years
to do so.



Where it starts: C
The earliest precursor of Java is C: a language developed by Ken Thompson
at Bell Labs in the early 1970s.
C was used as a system programming language for the DEC PDP-7.
C began achieving its widespread popularity when Bell's Unix operating
system was rewritten in C.
Unix was the first operating system written in a high-level language;
it was distributed to universities for free, where it became popular.
Linux is currently a popular (it is still free!) variant of Unix.
"C is a general-purpose programming language which features economy of
expression, modern control flow and data structures, and a rich set of
operators.
C is not a "very high level" language, nor a "big" one, and is not
specialized to any particular area of application."



- B. Kernighan/D. Ritchie: The C Programming Language

(Kernighan & Ritchie designed and implemented C)





From C to C++
"A programming language serves two related purposes: it provides a vehicle
for the programmer to specify actions to be executed, and it provides a
set of concepts for the programmer to use when thinking about what can
be done.
The first aspect ideally requires a language that is "close to the
machine," so that all important aspects of a machine are handled simply
and efficiently in a way that is reasonably obvious to the programmer.
The C language was primarily designed with this in mind.
The second aspect ideally requires a language that is "close to the
problem to be solved" so that the concepts of a solution can be expressed
directly and concisely.
The facilities added to C to create C++ were primarily designed with this
in mind"
- B. Stroustrup: The C++ Programming Language (2nd Ed)

(Stroustrup designed and implemented C++)





Java as a Successor to C++
"The Java programming language is a general-purpose, concurrent, class-based,
object-oriented language.
It is designed to be simple enough that many programmer can achieve fluency
in the language.
The Java programming language is related to C and C++ but it is organized
rather differently, with a number of aspects of C and C++ omitted and a
few ideas from other languages included.
It is intended to be a production language, not a research language, and
so, as C.A.R. Hoare suggested in his classic paper on language design,
the design has avoided including new and untested features.


...


The Java programming language is a relatively high-level language, in that
details of the machine representation are not available through the
language.
It includes automatic storage management, typically using a garbage
collector, to avoid the safety problems of explicit deallocation (as in
C's free or C++'s delete).
High-performance garbage-collected implementations can have bounded pauses
to support systems programming and real-time applications.
The language does not include any unsafe constructs, such as array accesses
without index checking, since such unsafe constructs would cause a
program to behave in an unspecified way."
- J. Gosling, B. Joy, G. Steele, G. Bracha: The Java Language Specification






Overview of Tokens in Java: The Big 6
In a Java program, all characters are grouped into symbols called
tokens.
Larger language features are built from the first five categories of tokens
(the sixth kind of token is recognized, but is then discarded by the Java
compiler from further processing).
We must learn how to identify all six kind of tokens that can appear in
Java programs.
In EBNF we write one simple rule that captures this structure:
token <= identifier | keyword | separator | operator | literal | comment

We will examine each of these kinds of tokens in more detail below, again
using EBNF.
For now, we briefly describe in English each token type.

  1. Identifiers: names the programmer chooses
  2. Keywords: names already in the programming language
  3. Separators (also known as punctuators): punctuation characters and
    paired-delimiters
  4. Operators: symbols that operate on arguments and produce results
  5. Literals (specified by their type)
    • Numeric: int and double
    • Logical: boolean
    • Textual: char and String
    • Reference: null
  6. Comments
    • Line
    • Block
Finally, we will also examine the concept of white space which is crucial to understanding how the Java compiler separates the characters in a program into a list of tokens; it sometimes helps decide where one token ends and where the next token starts.

The Java Character Set
The full Java character set includes all the
Unicode
characters; there are 216 = 65,536 unicode characters.
Since this character set is very large and its structure very complex, in
this class we will use only the subset of unicode that includes all the
ASCII (pronounced "Ask E") characters; there are 28 = 256
ASCII characters, of which we will still use a small subset containing
alphabetic, numeric, and some special characters.
We can describe the structure of this character set quite simply in EBNF,
using only alternatives in the right hand sides.

lower-case <= a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

upper-case <= A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z

alphabetic <= lower-case | upper-case

numeric     <= 0|1|2|3|4|5|6|7|8|9

alphanumeric <= alphabetic | numeric

special       <= !|%|^|&|*|(|)|-|+|=|{|}|||~|[|]|\|;|'|:|"|<|>|?|,|.|/|#|@|`|_

graphic     <= alphanumeric | special

In the special rule, the bracket/brace characters stand for themselves
(not EBNF options nor repetitions) and one instance of the vertical bar
stands for itself too: this is the problem one has when the character set
of the language includes special characters that also have meanings in
EBNF.

White space consists of spaces (from the space bar), horizontal and vertical
tabs, line terminators (newlines and formfeeds): all are non-printing
characters, so we must describe them in English.
White space and tokens are closely related: we can use white space to force
the end of one token and the start of another token (i.e., white space is
used to separate tokens).
For example XY is considered to be a single token, while X Y
is considered to be two tokens.
The "white space separates tokens" rule is inoperative inside
String/char literals, and comments (which are all discussed
later).

Adding extra white space (e.g., blank lines, spaces in a line -often for
indenting) to a program changes its appearance but not its meaning to Java:
it still comprises exactly the same tokens in the same order.
Programmers mostly use white space for purely stylistic purposes: to
isolate/emphasize parts of programs and to make them easier to read and
understand.
Just as a good comedian know where to pause when telling a joke; a good
programmer knows where to put white space when writing code.

Identifiers
The first category of token is an Identifier.
Identifiers are used by programmers to name things in Java: things such as
variables, methods, fields, classes, interfaces, exceptions, packages, etc.
The rules for recognizing/forming legal identifiers can be easily stated in
EBNF.
id-start    <= alphabetic | $ | _

identifier <= id-start{id-start | numeric }


Although identifiers can start with and contain the $ character,
we should never include a $ in identifiers that we write;
such identifiers are reserved for use by the compiler, when it needs to
name a special symbol that will not conflict with the names we write.

Semantically, all characters in an identifier are significant, including the
case (upper/lower) of the alphabetic characters.
For example, the identifier Count and count denote different
names in Java; likewise, the identifier R2D2 and R2_D2
denote different names.

When you read programs that I have written, and write your own program,
think carefully about the choices made to create identifiers.

  • Choose descriptive identifiers (mostly starting with lower-case
    letters).
  • Separate different words in an identfier with a case change:
    e.g., idCount; this is called "camel style", with each
    capital letter representing a hump.
  • Apply the "Goldilocks Principle": not too short, not too long, just
    right.
During our later discussions of programming style, we will examine the standard naming conventions that are recommend for use in Java code.
Carefully avoid identifiers that contain dollar signs; avoid
  • homophones (sound alike): aToDConvertor   a2DConvertor
  • homoglyphs (look alike): allOs vs. all0s and
    Allls vs All1s
      which contain the letter (capital) O, number 0, letter (small) l, letter (capital) I, and number 1
  • mirrors: xCount   countX

Keywords
The second category of token is a Keyword, sometimes called a
reserved word.
Keywords are identifiers that Java reserves for its own use.
These identifiers have built-in meanings that cannot change.
Thus, programmers cannot use these identifiers for anything other than their
built-in meanings.
Technically, Java classifies identifiers and keywords as separate categories
of tokens.
The following is a list of all 49 Java keywords we will learn the meaning
of many, but not all,of them in this course.
It would be an excellent idea to print this table, and then check off the
meaning of each keyword when we learn it; some keywords have multiple
meanings, determined by the context in which they are used.

abstractcontinuegotopackageswitch
assertdefaultifprivatethis
booleandoimplementsprotectedthrow
breakdoubleimportpublicthrows
byteelseinstanceofreturntransient
caseextendsintshorttry
catchfinalinterfacestaticvoid
charfinallylongstrictfpvolatile
classfloatnativesuperwhile
constfornewsynchronized
Notice that all Java keywords contain only lower-case letters and are at
least 2 characters long; therefore, if we choose identifiers that are very
short (one character) or that have at least one upper-case letter in them,
we will never have to worry about them clashing with (accidentally being
mistaken for) a keyword.
Also note that in the Metrowerks IDE (if you use my color preferences),
keywords always appear in yellow (while identifiers, and many other tokens,
appear in white).

We could state this same tabular information as a very long (and thus harder
to read) EBNF rule of choices (and we really would have to specify
each of these keywords, and not use "...") looking like

keyword <= abstract | boolean | ... | while

Finally, assert was recently added (in Java 1.4) to the original 48
keywords in Java.

Separators
The third category of token is a Separator (also known as a
punctuator).
There are exactly nine, single character separators in Java, shown in the
following simple EBNF rule.
separator <= ; | , | . | ( | ) | { | } | [ | ]

In the separator rule, the bracket/brace characters stand for
themselves (not EBNF options or repetitions).

Note that the first three separators are tokens that separate/punctuate
other tokens.
The last six separators (3 pairs of 2 each) are also known as delimiters:
wherever a left delimiter appears in a correct Java program, its matching
right delimiter appears soon afterwards (they always come in matched
pairs).
Together, these each pair delimits some other entity.

For example the Java code Math.max(count,limit); contains nine
tokens

  1. an identifier (Math), followed by
  2. a separator (a period), followed by
  3. another identifier (max), followed by
  4. a separator (the left parenthesis delimiter), followed by
  5. an identfier (count), followed by
  6. a separator (a comma), followed by
  7. another identifier(limit), followed by
  8. a separator (the right parenthesis delimiter), followed by
  9. a separator (a semicolon)

Operators
The fourth category of token is an Operator.
Java includes 37 operators that are listed in the table below;
each of these operators consist of 1, 2, or at most 3 special
characters.
=><!~?:
==<=>=!=&&||++--
+-*/&|^% <<>>>>>
+=-=*=/=&=|=^=%= <<=>>=>>=
The keywords instanceof and new are also considered operators
in Java.
This double classification can be a bit confusing; but by the time we
discuss these operators, you'll know enough about programmig to take them
in stride.

It is important to understand that Java always tries to construct the
longest token from the characters that it is reading.
So, >>= is read as one token, not as the three tokens >
and > and =, nor as the two tokens >> and
=, nor even as the two tokens > and >=.

Of course, we can always use white space to force Java to recognize separate
tokens of any combination of these characters:
writing >   >= is the two tokens > and >=.

We could state this same tabular information as a very long (and thus harder
to read) EBNF rule of choices (and we really would have to specify each of
these operators, and not use "...") looking like

operator <=   = | > | ... | >>= | instanceof | new


Types and Literals
The fifth, and most complicated category of tokens is the Literal.
All values that we write in a program are literals: each belongs to one of
Java's four primitive types (int, double, boolean,
char) or belongs to the special reference type String.
All primitive type names are keywords in Java; the String reference
type names a class in the standard Java library, which we will learn much
more about soon.
A value (of any type) written in a Java program is called a literal;
and, each written literal belongs in (or is said to have) exactly one type.
literal <= integer-literal | floating-point-literal | boolean-literal

               
| character-literal
| string-literal
| null-literal

Here are some examples of literals of each of these types.

Literaltype
1int
3.14double (1. is a double too)
trueboolean
'3'char ('P' and '+' are char too)
"CMU ID"String
nullany reference type
The next six sections discuss each of these types of literals, in more detail.

int Literals
Literals of the primitive type int represent countable, discrete
quantities (values with no fractions nor decimal places
possible/necessary).
We can specify the EBNF for an int literal in Java as
non-zero-digit     <= 1|2|3|4|5|6|7|8|9

digit                     <= 0 | non-zero-digit

digits                   <= digit{digit}

decimal-numeral <= 0 | non-zero-digit[digits]

integer-literal      <= decimal-numeral

                              | octal-numeral

                              | hexidecimal-numeral

This EBNF specifies only decimal (base 10) literals.
In Java literals can also be written in ocal (base 8) and hexidecimal
(base 16).
I have omitted the EBNF rules for forming these kinds of numbers, because we
will use base 10 exclusively.
Thus, the rules shown above are correct, but not complete.

By the EBNF rules above, note that the symbol 015 does not look like a
legal integer-literal; it is certainly not a
decimal-numeral, because it starts with a zero.
But, in fact, it is an octal-numeral (whose EBNF is not shown).
Never start an integer-literal with a 0 (unless its value is
zero), because starting with a 0 in Java signifies the literal is
being written as an octal (base 8) number: e.g., writing 015 refers
to an octal value, whose decimal (base 10) value is 13!
So writing a leading zero in an integer can get you very confused about what
you said to the computer.

Finally, note that there are no negative literals: we will see soon how to
compute such values from the negate arithmetic operator and a positive
literal (writing -1 is exactly such a construct).
This is a detail: a distinction without much difference.

double Literals
Literals of the primtive type double represent measureable quantities.
Like real numbers in mathematics, they can represent fractions and numbers
with decimal places.
We can specify the EBNF for a double literal in Java as
exponent-indicator   <= e | E

exponent-part           <= exponent-indicator [+|-]digits

floating-point-literal <= digits exponent-part

                       
       
| digits.[digits][exponent-part]

                       
       
| .digits[exponent-part]

This EBNF specifies a floating-point-literal to contain various
combinations of a decimal point and exponent (so long as one -or both- are
present); if neither is present then the literal must be classified as an
int-literal.
The exponent-indicator (E or e) should be read to mean
"times 10 raised to the power of".

Like literals of the type int, all double literals are
non-negative (although they may contain negative exponents).
Using E or e means that we can specify very large or small
values easily
(3.518E+15 is equivalent to 3.518 times 10 raised to the
power of 15, or 3518000000000000.; and 3.518E-15 is
equivalent to 3.518 times 10 raised to the power of -15, or
.000000000000003518)
In fact, any literal with an exponent-part is a double: so even
writing 1E3 is equivalent to writing 1.E3, which are both
equivalent to writing 1000.
Note this does not mean the int literal 1000!

Finally, all double literals must be written in base 10 (unlike
int literals, which can be written in octal or hexadecimal)

boolean Literals
The type name boolean honors George Boole, a 19th century English
mathematician who revolutionized the study of logic by making it more
like arithmetic.
He invented a method for calculating with truth values and an algebra for
reasoning about these calculations.
Boole's methods are used extensively today in the engineering of hardware
and software systems.
Literals of the primitive type boolean represent on/off, yes/no,
present/absent, ... data.
There are only two values of this primtive type, so its ENBF rule is
trivially written as

boolean-literal <= true | false


In Java, although these values look like identifiers, they are classified as
literal tokens (just as all the keywords also look like identifiers, but
are classified differently).
Therefore, 100 and true are both literal tokens in Java (of
type int and boolean respectively).

Students who are familiar with numbers sometimes have a hard time accepting
true as a value; but that is exactly what it is in Java.
We will soon learn logical operators that compute with these values of the
type boolean just as arithmetic operators compute with values of
the type int.

char Literals
The first type of text literal is a char.
This word can be pronounced in many ways: care, car, or as in
charcoal
(I'll use this last pronunciation).
Literals of this primitive type represent exactly one character inside
single quotes.
Its EBNF rule is written
character-literal <= 'graphic' | 'space' | 'escape-sequence'

where the middle option is a space between single quotes.
Examples are 'X', or 'x', or '?', or ' ', or
'\n', etc. (see below for a list of some useful escape sequences).

Note that 'X' is classified just as a literal token (of the primitive
type char); it is NOT classified as an identifier token inside two
separator tokens!

String Literals
The second type of text literal is a String.
Literals of this reference type (the only one in this bunch; it is not a
primitive type) represent zero, one, or more characters:
Its EBNF is written
string-literal <= "{graphic | space | escape-sequence}"

Examples are: "\n\nEnter your SSN:", or
"" (the empty String), or
"X" (a one character String, which is different from a
char).

Note that "CMU" is classified just as a literal token (of the
reference type String); it is NOT classified as an identifier token
inside two separator tokens!

Escape Sequences
Sometimes you will see an escape-sequence inside the single-quotes
for a character-literal or one or more inside double-quotes for a
string-literal (see above);
each escape sequence is translated into a character that prints in some
"special" way.
Some commonly used escape sequences are
Escape SequenceMeaning
\nnew line
\thorizontal tab
\vvertical tab
\bbackspace
\rcarriage return
\fform feed
\abell
\\\ (needed to denote \ in a text literal)
\'' (does not act as the right ' of a char literal)
\"" (does not act as the right " of a String literal)
So, in the String literal "He said, \"Hi.\"" neither escape
sequence \" acts to end the String literal: each represents
a double-quote that is part of the String literal, which displays
as He said, "Hi."

If we output "Pack\nage", Java would print on the console
Pack
age
with the escape sequence \n causing Java to immediately terminate the
current line and start at the beginning of a new line.
There are other ways in Java to write escape sequences (dealing with unicode
represented by octal numbers) that we will not cover here, nor need in the
course.
The only escape sequence that we wil use with any frequency is \n.

The null Reference Literal
There is a very simple, special kind of literal that is used to represent a
special value with every reference type in Java (so far we know only one,
the type String).
For completeness we will list it here, and learn about its use a bit later.
Its trivial EBNF rule is written
null-literal <= null

So, as we learned with boolean literals, null is a literal in
Java, not an identifier.

Bounded Numeric Types
Although there are an infinite number of integers in mathematics, values in
the int type are limited to the range from -2,147,483,648 to
2,147,483,647.
We will explore this limitation later in the course, but for now we will not
worry about it.
Likewise, although there are an infinite number of reals in mathematics,
values in the double type are limited to the range from

-1.79769313486231570x10308
to
1.79769313486231570x10308; the smallest non-zero, positive value
is
4.94065645841246544x10-324.
Values in this type can have up to about 15 significant digits.
For most engineering and science calculations, this range and precision are
adequate.

In fact, there are other primitive numeric types (which are also keywords):
short, long, and float.
These types are variants of int and double and are not as
widely useful as these more standard types, so we will not cover them in
this course.

Finally, there is a reference type named BigInteger, which can
represent any number of digits in an integer (up to the memory capacity of
the machine).
Such a type is very powerful (because it can represent any integer), but
costly to use (in execution time and computer space) compared to int.
Most programs can live with the "small" integer values specified above; but,
we will also study this reference type soon, and write programs using it.



Comments
The sixth and final category of tokens is the Comment.
Comments allow us to place any form of documentation inside our Java code.
They can contain anything that we can type on the keyboard: English,
mathematics, even low-resolution pictures.
In general, Java recognizes comments as tokens, but then excludes these
tokens from further processing; technically, it treats them as white space
when it is forming tokens.
Comments help us capture aspects of our programs that cannot be expressed as
Java code.
Things like goals, specification, design structures, time/space tradeoffs,
historical information, advice for using/modifying this code, etc.
Programmers intensely study their own code (or the code of others) when
maintaining it (testing, debugging or modifying it).
Good comments in code make all these tasks much easier.

Java includes two style for comments.

  • Line-Oriented: begins with // and continues until the end of
    the line.
  • Block-Oriented: begins with /* and continues (possibly over
    many lines) until */ is reached.
    • So, we can use block-oriented comments to create multiple comments
      within a line

          display(/*Value*/ x, /*on device*/ d);

      In contrast, once a line-oriented comment starts, everything
      afterward on its line is included in the comment.
    • We can also use block-oriented comments to span multiple lines

      /*

          This is a multi-line comment.

          No matter home many lines

          it includes, only one pair

          of delimiters are needed.

      */


      In contrast, a line-oriented comment stops at the end of the line
      it starts on.
Technically, both kinds of comments are treated as white space, so writing
X/*comment*/Y has the same meaning in Java as writing the tokens
X and Y, not the single token XY.
Typically Java comments are line-oriented; we will save block-oriented
comments for a special debugging purpose (discussed later).

The EBNF rule for comments is more complicated than insightful, so we will
not study here.
This happens once in a while.





Program are a Sequence of Tokens built from Characters
The first phase, a Java compiler tokenizes a program by scanning its
characters left to right, top to bottom (there is a hidden end-of-line
character at the end of each line; recall that it is equivalent to white
space), and combining selected characters into tokens.
It works by repeating the following algorithm (an algorithm is a precise set
of instructions):
  • Skip any white space...
  • ...if the next character is an underscore, dollar, or alphabetic
    character, it builds an identifier token.
    • Except for recognizing keywords and certain literals (true,
      false, null) which all share the form of identifiers,
      but are not themselves identifiers
  • ...if the next character is a numeric character, ' or ", it builds a
    literal token.
  • ...if the next character is a period, that is a seperator unless the
    character after it is a numeric character (in which case it builds a
    double literal).
  • ...if the next two characters are a // or /* starting a comment, it
    builds a comment token.
  • ...if the next character is anything else, it builds a separator or
    operator token (trying to build the longest token, given that white
    space separates tokens, except in a char or String
    literal).
Recall that white space (except when inside a textual literal or comment) separates tokens.
Also, the Java compiler uses the "longest token rule": it includes characters in a token until it reaches a character that cannot be included.
Finally, after building and recognizing each token, the Java compiler passes all tokens (except for comments, which are ignored after being tokenized) on to the next phase of the compiler.

Common Mistakes
I have seen the following mistakes made repeatedly by beginning students
trying to tokenize programs.
Try to understand each of these subtle points.
  • Tokenizing x as a char literal: it is an identifier.
  • Tokenizing 10.5 as two int literals separated by a
    period: it is a double literal.
  • Tokenizing int as a literal: it is a keyword, that happens to
    name a type in Java.
    Tokens like 1 are literals whose type is int; the token
    int is a keyword.
  • Tokenizing "Hi" as two separators with the identifier
    Hi in between: it is a single String literal.
  • Tokenizing something like }; as one separator token: it is
    really two separate separators.
  • Tokenizing something like += as two separate operator tokens
    (because + and = are operators): it is really one
    large token (because += is an operator).
  • Forgetting to tokenize parentheses, semicolons, and other separators:
    everything except white space belongs in some token.
  • Creaing tokens inside comments: each comment is one big token that
    includes all the characters in the comment.

A Simple Program
The following program will serve as a model of Input/Calculate/Output
programs in Java.
Here are some highlights
  • A large, multi-line (oriented) comment appears at the top of the
    program.
    Line-oriented comments appear at various other locations in the
    program.
  • The Prompt class is imported from the
    edu.cmu.cs.pattis.cs151xx package.
  • The Application class is declared.
  • Its main method is declared; its body (the statements it
    executes) is placed between the { and } delimiters.
  • Each simple statement in the body is ended by a semicolon (;)
    separator.
  • Three variables storing double values are declared.
  • The user is prompted for the value to store in the first two variables.
  • The third variable's value is computed and stored.
  • The third variable's value is printed (after printing a blank line).
Besides just reading this program, practice tokenzing it.
//////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////
//
// Description:
//
//   This program computes the time it take to drop an object (in a vacuum)
// form an arbitrary height in an arbitrary gravitational field (so it can
// be used to calculate drop times on other planets). It models a straight
// input/calculate/output program: the user enters the gravitation field
// and then the height; it calculates thd drop time and then prints in on
// the console.
//
//////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////


import edu.cmu.cs.pattis.cs151xx.Prompt;


public class Application {


  public static void main(String[] args)
    {
      try {

        double gravity;        //meter/sec/sec
        double height;         //meters
        double time;           //sec
		  
		  
        //Input
		  
        gravity = Prompt.forDouble("Enter gravitational acceleration (in meters/sec/sec)");
        height  = Prompt.forDouble("Enter height of drop (in meters)");
		  
		  
        //Calculate
		  
        time = Math.sqrt(2.*height/gravity);
		  
		  
        //Output
		  
        System.out.println("\nDrop time = " + time + " secs");

		  
      }catch (Exception e) {
        e.printStackTrace();
        System.out.println("main method in Application class terminating");
        System.exit(0);  
   }

}

How Experts See Programs
In the 1940s, a Dutch psychologist named DeGroot was doing research on chess
experts.
He performed the following experiment: He sat chess experts down in front of
an empty chessboard, all the chess pieces, and a curtain.
Behind the curtain was a chessboard with its pieces arranged about 35 moves
into a game.
The curtain was raised for one minute and then lowered.
The chess experts were asked to reconstruct what they remembered from seeing
the chessboard behind the curtain.
In most cases, the chess experts were able to completely reconstruct the
board that they saw.
The same experiment was conducted with chess novices, but most were able to
remember only a few locations of the pieces.
These results could be interpreted as, "Chess experts have much better
memories than novices."

So, DeGroot performed a second (similar) experiment.
In the second experiment, the board behind the curtain had the same number
of chess pieces, but they were randomly placed on the board; they did not
represent an ongoing game.
In this modified experiment, the chess experts did only marginally better
than the novices.
DeGroot's conclusion was that chess experts saw the board differently than
novices: they saw not only pieces, but attacking and defending structures,
board control, etc.

In this class, I am trying to teach you how to see programs as a programmer
sees them: not as a sequence of characters, but at a higher structural
level.
Tokens is where we start this process.

Problem Set
To ensure that you understand all the material in this lecture, please solve
the the announced problems after you read the lecture.
If you get stumped on any problem, go back and read the relevant part of the
lecture.
If you still have questions, please get help from the Instructor, a CA, a Tutor,
or any other student.

  1. Classify each of the following as a legal or illegal identifier.
    If it is illegal, propose a legal identifier that can take its place
    (a homophone or homoglyph)
    packAge x12 2Lips
    xOrY sum of squares %Raise
    termInAte u235 $Bill
    x_1 x&Y 1derBoys

  2. What tokens does Java build from the characters ab=c+++d==e.
    Be sure that you know your Operators.

  3. Classify each of the following numeric literals as int, or double, or
    illegal (neither); write the equivalent value of each double without using
    E notation; for each illegal literal, write a legal one with the "same" value.
    5. 3.1415 17
    17.0 1E3 1.E3
    .5E-3 5.4x103 50E-1
    1,024 0.087 .087

  4. What is the difference between 5, 5., five,
    '5', and "5"?
    What is the difference between true and "true"?

  5. Write a String literal that includes the characters
    I've said a million times, "Do not exaggerate!"

  6. How does Java classify each of the following lines

        "//To be or not to be"

        //"To be or not to be"

  7. Does the following line contain one comment or two?

       //A comment //Another comment?

  8. Explain whether X/**/Y is equivalent to XY or X   Y.

  9. Tokenize the following Java Code (be careful): -15

  10. Tokenize the following line of Java code: identify every Java token as either an
    Identifier, Keyword, Separator, Operator, Literal (for any literal, also specify its type), or Comment.
    Which (if any) identifiers are keywords?

    int X = Prompt.forInt("SSN",0,999999999); //Filter && use

  11. Choose an appropriate type to represent each of the following pieces of information
    • the number of characters in a file
    • a time of day (accurate to 1 second)
    • the middle initial in a name
    • whether the left mouse button is currently pushed
    • the position of a rotary switch (with 5 positions)
    • the temperature of a blast furnace
    • an indication of whether one quantity is less than, equal to or greater than another
    • the name of a company
  12. This problem (it is tricky, so do it carefully) shows a difficulty with using
    Block-Oriented comments.
    Tokenize the following two lines of Java code: identify every token as either an
    Identifier, Keyword, Separator, Operator, Literal, or Comment. What problem arises?

    x = 0;  /* Initialize x and y to
      y = 1;     their starting values */
    Rewrite the code shown above with Line-Oriented comments instead, to avoid this problem.
    How can our use of my Java preferences help us avoid this error?

  13. This problem (it is tricky, so do it carefully) shows another difficulty with using
    Block-Oriented comments.
    Tokenize the following Java code: identify every token as either an
    Identifier, Keyword, Separator, Operator, Literal, or Comment. What problem arises?

    /*
        here is an outer
        comment with an
        /* inner comment inside */
        and the finish of the outer
        comment at the end
      */
    Rewrite the code shown above with Line-Oriented comments instead, to avoid this problem.
    How can our use of my Java preferences help us avoid this error?


  14. Explain why language designers are very reluctant to add new keywords
    to a programming language.
    Hint: what problem might this cause in already-written programs?

No comments: