Wow. Firstly, thanks Gerard for successfully trying this out. You
read them ;). Nonetheless your work will be an excellent source for
auditing the correctness of mine.
- publishing "blead", Change 1 .. Change 30371 - converted by my
scripts. Change 1 is the first commit.
- publishing "Gerard's Perl7" based on recent blead.
- Changes 17649 .. 30364 of maint as head "5.8"
- Change 17639 .. 29402 of blead as "5.9"
- publishing "restorical" (perl 1 .. perl 5.003_07) - converted by
translation of the perforce depot path. Change 1 is based atop of the
perl 5.003 commit in the restorical branch.
Right now none of these contain the filelog metadata.
This still isn't ready for general consumption - early adopters only.
cleaned up one. It should end up a little smaller than the 200MB it is
currently, as well. Until this clean-up is done, don't try to merge
between the branches; cherry-pick instead.
Post by Gerard Goossen
I previous posted a Unicode-handling patch and suggested some
development changes. Since then I have made more
changes. Because posting 1MB patches isn't
very convenient I have made a public accessible repository
with my branch.
- C<use strict qw"refs subs"> is default active
- removal of indirect object syntax (eq: C<new Foobar>)
- The old Unicode handling change
- removal of C<format> keyword (C<formline> still works)
Except from the Unicode change, the others are obvious, in the sense
that they were already best practices. And probably are good example
of where I would like Perl development to go.
Because I think there was still some misconception about what my Unicode
patch does, I have attached some documentation explaining the new
git clone git://dev.tty.nl/perl/
this branch corresponds to bleadperl, I have to run a
script to update it so it might be a few changes behind,
otherwise it should be identical to bleadperl
- gerard (the default branch)
Very experimental branch. Think of it as a suggestion for
Perl7 development. Contains changes which are often not
backward compatible. I use the name Perl kurila in the branch
to refer to the the latest version. Referring to it means you
refer to a moving target.
I try to backport to bleadperl the changes which are backwards
compatibel with Perl 5.
Gerard Goossen, tty.nl
ps. I use git because I have experience
with it and Sam Vilain wrote some very nice scripts which can keep
track of bleadperl in git.
See http://www.kernel.org/pub/software/scm/git/docs/tutorial.html for an
introduction to git.
The Debian git package is called git-core.
pps. Subject by Harm
=head2 Characters represented by Code points represented by Bytes
According to the Unicode Standard: I<"Characters are the abstract
representations of the smallest components of written language that
have semantic value.">
The Unicode meaning of character is often B<not> the meaning used by
Perl. Characters used to be represented by bytes, for example the C type
C<char> is a byte, or in Perl 5 the function C<chr>, always returned a
byte for values below 256. The Unicode definition of character is also
B<not> the idea of character that most people have.
Code points are numeric values representing Unicode characters. Because
this is normally the only represented, and the term is less ambiguous
then characters, this is the preferred term.
Perl kurila uses the UTF-8 (or UTF-EBCDIC) encoding to store code
points in bytes or 8-bit code units using Unicode terms. Multiple
bytes may be needed to store a single code point.
=head2 Byte and Code points Semantics
Perl kurila uses the lexical scope to determine whether to use byte semantics
or code point semantics. The C<bytes> pragma forces byte semantics, the
C<utf8> pragma forces code points semantics. For compatibility reasons
the default behaviour is byte semantics, but this might change.
The C<bytes> pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See L<bytes>.
The C<utf8> pragma will force code points semantics in a particular
lexical scope. See L<utf8>. The C<utf8> pragma also set that the
input of the parser is UTF-8. The pragma to force code points semantics
might be changed to C<codepoints>.
Under code points semantics, many operations that formerly operated on
bytes now operate on code points. A code point in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
code points may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
=head2 Porting code from perl-5.8.x
Deciding code points or byte semantics
Perl kurila uses the lexical scope to determine to use code points or
byte semantics. Decide which should be used and add the C<utf8> or
C<bytes> pragma accordingly. Currently the default is byte semantics,
but this might change, so it is strongly advised to make the choice
explicit using the C<utf8> or C<bytes> pragmas.
Handling latin1 text
Latin1 text should be decode upon reading and encoded upon
writing. Using either C<Encode::decode> or using an
Not using the UTF-8 flag
There is no UTF-8 flag anymore, so there is no need to preserve or set
it. Functions to modify the flag have been removed or give an error.