I am about to port Texy! to Java.
But that's a long-term plan, therefore this page is here as a placeholder.
Texy! author, David Grudl, created a test suite of several Texy! formatted files, which can be used to validate the implementation.
Available at http://download.texy.info/refs.zip .
I am looking for a Java regexp lib with support for recursion, like:
<a+(?0)>
. See http://www.php.net/…ecursive.php.
I need it for this expression:
(?:mUi)^/--++ *+(.*)(?: *(?<= |^)\\.((?:\\([^)\\n]+\\)|\\[[^\\]\\n]+\\]|\\{[^}\\n]+\\}|<>|>|=|<){1,4}?))?$((?:\\n.*+)*)(?:\\n(?0)|\\n\\\\--.*$|\\z)
JDK does not support it, ORO does neither, nor Stevesoft Pat does.
dev.java.net
is sooooo
slooow…Gone to Google Code: http://code.google.com/p/jtexy/
ORO developement stalled in 2004? Is Jakarta Regexp a successor? Is Jakarta Regexp PERL compatible? ORO claims so, Regexp does not.
// parse loop $matches = array(); $priority = 0; foreach ($this->patterns as $name => $pattern) { preg_match_all( $pattern['pattern'], $text, $ms, PREG_OFFSET_CAPTURE | PREG_SET_ORDER ); foreach ($ms as $m) { $offset = $m[0][1]; foreach ($m as $k => $v) $m[$k] = $v[0]; $matches[] = array($offset, $name, $m, $priority); } $priority++; }
PREG_OFFSET_CAPTURE
– makes the returned multidimensional
array even more multidimensional – stores the string at [0] and the offset at
[1] (well, PHP is really that shitty).
PREG_SET_ORDER
– the matches are ordered so that sub-group is
next to it's parent group or it's sibling (depth-first instead of
breadth-first).
|...|U
makes all quantifiers ungreedy:
You could also make ALL the quantifiers in a regular expression „ungreedy“ by using the U modifier. http://www.skdevelopment.com/…ressions.php
No Java library supports global ungreediness! :-((
* StackOverflow
post:
* Java forums
post:
* Java
forums archive:
* Ales Novak – ales A netbeans.com
* java.util.regex JavaDoc: http://java.sun.com/…Pattern.html
* java.util.regex tester: http://www.regexplanet.com/…e/index.html
* OpenJDK should support LAZY
* There's a PHP port to Java: http://www.java2s.com/…java-doc.htm
* I could hack the JDK code: http://hg.openjdk.java.net/…Pattern.java
* http://hg.openjdk.java.net/…Pattern.java
* 2721 private void addFlag(), 2760 private void subFlag()
preg_match_all()
:Is this the same as multiple calls to preg_match()
?
→ preg_match()
returns the number of times pattern matches. That
will be either 0 times (no match) or 1 time because preg_match()
will stop searching after the first match. preg_match_all()
on the
contrary will continue until it reaches the end of subject.
Is there similar in Java, or do I have to iterate?
Original design is, despite of author's credits,… not perfect.
* The way modules are registered
* and whole callback system
RegExp.Patterns.php
– slash hell – do the slashes belong
to RE or to PHP?
protect() – will I preserve it? Yes.
Is dom4j or jDOM enough for #JTexy? Yet seems so.
modifier.decorate( elmRet ) dosn't need texy reference – it should keep it from it's creation.
Why Texy doesn't keep $handler
, $pattern
,
$name
together in a
class
? function registerLinePattern($handler, $pattern, $name)
WTF, list(, $mParam, $mMod, $mContent) = $matches;
everywhere?
Again, why not put it in a class
?
$pattern . 'Am', // anchored & multiline
→ #JTexy: Is it
the same as prepended ^
?
TexyParser.php @ 169: $priority++;
– later patterns have
higher priority? No, the other way.
What is TexyLineParser.again
good for? Never set to
true
?
Zdravím,
mám pár návrhů na mírný refactoring parseru:
1) Zavést třídu pro patterny; handlery uložit k patternům rovnou jako referenci funkce.
public class RegexpInfo { public String name; public String perlRegexp; public String regexp; public String flags; public String mode; public String htmlElement; // Handler of this pattern - a "callback". public PatternHandler handler; public enum Type { LINE, BLOCK }; public Type type = Type.LINE; }2) V parseru: Výsledky
preg_match_all()
převést na pole objektů, místo pole polí (… polí polí, jak je zvykem v PHP).class ParserMatchInfo implements Comparable<ParserMatchInfo> { public final RegexpInfo pattern; public final List<MatchWithOffset> groups; public final int offset; public final int priority; }Taková třída je taky přirozeným místem pro funkci
cmp()
.3) Při vytváření objektu s informacemi o matchnutí rovnou uložit referenci na použitý patten, místo jeho jména.
Zaprvé se tak ušetří jeden lookup, zadruhé to zpřehlední kód.4) V názvech proměnných zavést pojem
group
pro skupiny matchnutí reg. výrazu. „Match“ ponechat pro celý match.
Aneb:preg_match_all( $pattern['pattern'], $text, $ms, PREG_SET_ORDER ); $matches[0]; // --> First match $matches[0][0]; // --> Group - matched text of the whole first match. $matches[0][1]; // --> First sub-group of the first match.Tak nebude nutné pojmenovávat proměnné $ms, $matches, $m, $mMatches, …
**5) Přesunout TexyParagraphModule.process() do TexyBlockParser
**6) Zavést třídu ParserEvent, dědit od ní a tyto používat pro model událostí.
Jsou to všechno interní věci, takže to nikoho moc netrápí, ale když už je to ten opensource a někdo by to mohl chtít časem hackovat… ;-)
Ondra
PS: Kódy jsou v Javě; nápady na refaktorizaci totiž pocházejí ze vznikajícího JTexy.
Regex: Support for UNGREEDY or (?U) flag
Perl regular expressions have „/…/U“ and „(?U)“. Many of text processing tools based on regexp use this extensively (e.g. Texy).
These tools are done in a way that makes quite hard to add the ungreedy
(reluctant) mode, using .*?
, to all closures – they have many
regular expressions (few dozens to hundreds), and many of them change with each
release.
Regex: „.*foo“ with UNGREEDY flag or „(?U).*foo“ String: „AAAfooBBBfoo“
Desired result:
group(1): AAAfoo
group(2): BBBfoo
Current result:
group(1): AAAfooBBBfoo
String inputStr = „AAAfooBBBfoo“; String patternStr = „.*foo“;
Pattern pattern = Pattern.compile(patternStr); Matcher matcher = pattern.matcher(inputStr);
… etc.
Well, changing all closures to be reluctant. sigh
Texy je bezesporu jeden z pokladů českého opensource. Je škoda, že je zatím jen pro PHP.
Zdravím,nechtěl by někdo čirou náhodou implementovat překladač Texy v Javě?
Mohlo by to být dobré téma projektu / bakalářky… Formální specifikace není, jen podle popisu syntaxe a ukázek použití…
Java komunita by vás jistě oslavovala :-)
Ondra
Autor Texy David Grudl sestavil sadu testovacích souborů, na kterých lze ověřit implementaci překladače. K dispozici na http://download.texy.info/refs.zip
Algou a omegou původní PHP imiplementace jsou PCRE. Knihovna pro Javu, která zvládá PCRE podle Perl 5, je na http://jakarta.apache.org/oro/ .
Ondra Žižka