Dinh Doan's blog

Dec 31, 2010

How to create linked index in OpenOffice

From using MS Office, I am now using OpenOffice an get many troubles in first times. Maybe you are same me!

Here, I am going to explain how it’s possibile to create a linked index of a document. All of us know the importante of indexes in written documents which immediatly redirects you to the part of the document you are intrested in.

Firstly, you must format you headers. You can see this options on formatting toolbar. If its not displayed you can display it by clicking View >> Toolbars >> Formatting. See the figure:

You can set levels by the way you want. There are some levels there.

Next, click the cursor at place you want to set indexes for you document (usually at the beginning). Then click Insert >> Indexes and Tables >> Indexes and Tables. See an follow the figures:

Follow my steps, do exactly:

And this is my result :)

Enjoy !

Dec 30, 2010

English stop words

a
able
about
above
abst
accordance
according
accordingly
across
act
actually
added
adj
adopted
affected
affecting
affects
after
afterwards
again
against
ah
all
almost
alone
along
already
also
although
always
am
among
amongst
an
and
announce
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anyways
anywhere
apparently
approximately
are
aren
arent
arise
around
as
aside
ask
asking
at
auth
available
away
awfully
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
begin
beginning
beginnings
begins
behind
being
believe
below
beside
besides
between
beyond
biol
both
brief
briefly
but
by
c
ca
came
can
cannot
can't
cause
causes
certain
certainly
co
com
come
comes
contain
containing
contains
could
couldnt
d
date
did
didn't
different
do
does
doesn't
doing
done
don't
down
downwards
due
during
e
each
ed
edu
effect
eg
eight
eighty
either
else
elsewhere
end
ending
enough
especially
et
et-al
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
except
f
far
few
ff
fifth
first
five
fix
followed
following
follows
for
former
formerly
forth
found
four
from
further
furthermore
g
gave
get
gets
getting
give
given
gives
giving
go
goes
gone
got
gotten
h
had
happens
hardly
has
hasn't
have
haven't
having
he
hed
hence
her
here
hereafter
hereby
herein
heres
hereupon
hers
herself
hes
hi
hid
him
himself
his
hither
home
how
howbeit
however
hundred
i
id
ie
if
i'll
im
immediate
immediately
importance
important
in
inc
indeed
index
information
instead
into
invention
inward
is
isn't
it
itd
it'll
its
itself
i've
j
just
k
keep

keeps
kept
keys
kg
km
know
known
knows
l
largely
last
lately
later
latter
latterly
least
less
lest
let
lets
like
liked
likely
line
little
'll
look
looking
looks
ltd
m
made
mainly
make
makes
many
may
maybe
me
mean
means
meantime
meanwhile
merely
mg
might
million
miss
ml
more
moreover
most
mostly
mr
mrs
much
mug
must
my
myself
n
na
name
namely
nay
nd
near
nearly
necessarily
necessary
need
needs
neither
never
nevertheless
new
next
nine
ninety
no
nobody
non
none
nonetheless
noone
nor
normally
nos
not
noted
nothing
now
nowhere
o
obtain
obtained
obviously
of
off
often
oh
ok
okay
old
omitted
on
once
one
ones
only
onto
or
ord
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
page
pages
part
particular
particularly
past
per
perhaps
placed
please
plus
poorly
possible
possibly
potentially
pp
predominantly
present
previously
primarily
probably
promptly
proud
provides
put
q
que
quickly
quite
qv
r
ran
rather
rd
re
readily
really
recent
recently
ref
refs
regarding
regardless
regards
related
relatively
research
respectively
resulted
resulting
results
right
run
s
said
same
saw
say
saying
says
sec
section
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sent
seven
several
shall
she
shed
she'll
shes
should
shouldn't
show
showed
shown
showns
shows
significant
significantly
similar
similarly
since
six
slightly
so
some
somebody
somehow
someone
somethan
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specifically
specified
specify
specifying
state
states
still
stop
strongly
sub
substantially
successfully
such
sufficiently
suggest
sup
sure

t
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
thered
therefore
therein
there'll
thereof
therere
theres
thereto
thereupon
there've
these
they
theyd
they'll
theyre
they've
think
this
those
thou
though
thoughh
thousand
throug
through
throughout
thru
thus
til
tip
to
together
too
took
toward
towards
tried
tries
truly
try
trying
ts
twice
two
u
un
under
unfortunately
unless
unlike
unlikely
until
unto
up
upon
ups
us
use
used
useful
usefully
usefulness
uses
using
usually
v
value
various
've
very
via
viz
vol
vols
vs
w
want
wants
was
wasn't
way
we
wed
welcome
we'll
went
were
weren't
we've
what
whatever
what'll
whats
when
whence
whenever
where
whereafter
whereas
whereby
wherein
wheres
whereupon
wherever
whether
which
while
whim
whither
who
whod
whoever
whole
who'll
whom
whomever
whos
whose
why
widely
willing
wish
with
within
without
won't
words
world
would
wouldn't
www
x
y
yes
yet
you
youd
you'll
your
youre
yours
yourself
yourselves
you've
z
zero

Introduction to Text Indexing with Apache Jakarta Lucene

What Lucene Is

Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run. It offers a simple, yet powerful core API. To start using it, one needs to know only a few Lucene classes and methods.

Lucene offers two main services: text indexing and text searching. These two activities are relatively independent of each other, although indexing naturally affects searching. In this article I will focus on text indexing, and we will look at some of the core Lucene classes that provide text indexing capabilities.

Lucene Background

Lucene was originally written by Doug Cutting and was available for download from SourceForge. It joined the Apache Software Foundation's Jakarta family of open source server-side Java products in September of 2001. With each release since then, the project has enjoyed more visibility, attracting more users and developers. As of November 2002, Lucene version 1.2 has been released, with version 1.3 in the works. In addition to those organizations mentioned on the "Powered by Lucene" page, I have heard of FedEx, Overture, Mayo Clinic, Hewlett Packard, New Scientist magazine, Epiphany, and others using, or at least evaluating, Lucene.

Installing Lucene

Like most other Jakarta projects, Lucene is distributed as pre-compiled binaries or in source form. You can download the latest official release from Lucene's release page. There are also nightly builds, if you'd like to use the newest features. To demonstrate Lucene usage, I will assume that you will use the pre-compiled distribution. Simply download the Lucene .jar file and add its path to your CLASSPATH environment variable. If you choose to get the source distribution and build it yourself, you will need Jakarta Ant and JavaCC, which is available as a free download. Although the company that created JavaCC no longer exists, you can still get JavaCC from the URL listed in the References section of this article.

Indexing with Lucene

Before we jump into code, let's look at some of the fundamental Lucene classes for indexing text. They are IndexWriter, Analyzer, Document, and Field.
IndexWriter is used to create a new index and to add Documents to an existing index.
Before text is indexed, it is passed through an Analyzer. Analyzers are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently-used words that don't help distinguish one document from the other, such as "a," "an," "the," "in," "on," etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.
An index consists of a set of Documents, and each Document consists of one or more Fields. Each Field has a name and a value. Think of a Document as a row in a RDBMS, and Fields as columns in that row.
Now, let's consider the simplest scenario, where you have a piece of text to index, stored in an instance of String. Here is how you could do it, using the classes described above:

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/**
 * LuceneIndexExample class provides a simple
 * example of indexing with Lucene.  It creates a fresh
 * index called "index-1" in a temporary directory every
 * time it is invoked and adds a single document with a
 * single field to it.
 */
public class LuceneIndexExample
{
    public static void main(String args[]) throws Exception
    {
        String text = "This is the text to index with Lucene";

        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index-1";
        Analyzer analyzer = new StandardAnalyzer();
        boolean createFlag = true;

        IndexWriter writer =
            new IndexWriter(indexDir, analyzer, createFlag);
        Document document  = new Document();
        document.add(Field.Text("fieldname", text));
        writer.addDocument(document);
        writer.close();
    }
}

Let's step through the code. Lucene stores its indices in directories on the file system. Each index is contained within a single directory, and multiple indices should not share a directory. The first parameter in IndexWriter's constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer that should be used for pre-processing the text before it is indexed. This particular implementation of Analyzer eliminates stop words, converts tokens to lower case, and performs a few other small input modifications, such as eliminating periods from acronyms. The last parameter is a boolean flag that, when true, tells IndexWriter to create a new index in the specified directory, or overwrite an index in that directory, if it already exists. A value of false instructs IndexWriter to instead add Documents to an existing index. We then create a blank Document, and add a Field called fieldname to it, with a value of the String that we want to index. Once the Document is populated, we add it to the index via the instance of IndexWriter. Finally, we close the index. This is important, as it ensures that all index changes are flushed to the disk.

Analyzers

As I already mentioned, Analyzers are components that pre-process input text. They are also used when searching. Because the search string has to be processed the same way that the indexed text was processed, it is crucial to use the same Analyzer for both indexing and searching. Not using the same Analyzer will result in invalid search results.
The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. Should you need to pre-process input text and queries in a way that is not provided by any of Lucene's Analyzers, you will need to implement a custom Analyzer. If you are indexing text with non-Latin characters, for instance, you will most definitely need to do this.

In this example of a custom Analyzer, we will assume we are indexing text in English. Our PorterStemAnalyzer will perform Porter stemming on its input. As stated by its creator, the Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English. Its main function is to be part of a term normalization process that is usually done when setting up Information Retrieval systems.

This Analyzer will use an implementation of the Porter stemming algorithm provided by Lucene's PorterStemFilter class.

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;
import java.util.Hashtable;

/**
 * PorterStemAnalyzer processes input
 * text by stemming English words to their roots.
 * This Analyzer also converts the input to lower case
 * and removes stop words.  A small set of default stop
 * words is defined in the STOP_WORDS
 * array, but a caller can specify an alternative set
 * of stop words by calling non-default constructor.
 */
public class PorterStemAnalyzer extends Analyzer
{
    private static Hashtable _stopTable;

    /**
     * An array containing some common English words
     * that are usually not useful for searching.
     */
    public static final String[] STOP_WORDS =
    {
        "0", "1", "2", "3", "4", "5", "6", "7", "8",
        "9", "000", "$",
        "about", "after", "all", "also", "an", "and",
        "another", "any", "are", "as", "at", "be",
        "because", "been", "before", "being", "between",
        "both", "but", "by", "came", "can", "come",
        "could", "did", "do", "does", "each", "else",
        "for", "from", "get", "got", "has", "had",
        "he", "have", "her", "here", "him", "himself",
        "his", "how","if", "in", "into", "is", "it",
        "its", "just", "like", "make", "many", "me",
        "might", "more", "most", "much", "must", "my",
        "never", "now", "of", "on", "only", "or",
        "other", "our", "out", "over", "re", "said",
        "same", "see", "should", "since", "so", "some",
        "still", "such", "take", "than", "that", "the",
        "their", "them", "then", "there", "these",
        "they", "this", "those", "through", "to", "too",
        "under", "up", "use", "very", "want", "was",
        "way", "we", "well", "were", "what", "when",
        "where", "which", "while", "who", "will",
        "with", "would", "you", "your",
        "a", "b", "c", "d", "e", "f", "g", "h", "i",
        "j", "k", "l", "m", "n", "o", "p", "q", "r",
        "s", "t", "u", "v", "w", "x", "y", "z"
    };

    /**
     * Builds an analyzer.
     */
    public PorterStemAnalyzer()
    {
        this(STOP_WORDS);
    }

    /**
     * Builds an analyzer with the given stop words.
     *
     * @param stopWords a String array of stop words
     */
    public PorterStemAnalyzer(String[] stopWords)
    {
        _stopTable = StopFilter.makeStopTable(stopWords);
    }

    /**
     * Processes the input by first converting it to
     * lower case, then by eliminating stop words, and
     * finally by performing Porter stemming on it.
     *
     * @param reader the Reader that
     *               provides access to the input text
     * @return an instance of TokenStream
     */
    public final TokenStream tokenStream(Reader reader)
    {
        return new PorterStemFilter(
            new StopFilter(new LowerCaseTokenizer(reader),
                _stopTable));
    }
}

The tokenStream(Reader) method is the core of the PorterStemAnalyzer. It lower-cases input, eliminates stop words, and uses the PorterStemFilter to remove common morphological and inflexional endings. This class includes only a small set of stop words for English. When using Lucene in a production system for indexing and searching text in English, I suggest that you use a more complete list of stop words, such as this one.
To use our new PorterStemAnalyzer class, we need to modify a single line of our LuceneIndexExample class shown above, to instantiate PorterStemAnalyzer instead of StandardAnalyzer:
Old line:

Analyzer analyzer = new StandardAnalyzer();

New line:

Analyzer analyzer = new PorterStemAnalyzer();

The rest of the code remains unchanged. Anything indexed after this change will pass through the Porter stemmer. The process of text indexing with PorterStemAnalyzer is depicted in Figure 1.

Figure 1: The indexing process with PorterStemAnalyzer.
Because different Analyzers process their text input differently, note again that changing the Analyzer for an existing index is dangerous. It will result in erroneous search results later, in the same way that using a different Analyzer for both indexing and searching will produce invalid results.

Field Types

Lucene offers four different types of fields from which a developer can choose: Keyword, UnIndexed, UnStored, and Text. Which field type you should use depends on how you want to use that field and its values.
Keyword fields are not tokenized, but are indexed and stored in the index verbatim. This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc.
UnIndexed fields are neither tokenized nor indexed, but their value is stored in the index word for word. This field is suitable for fields that you need to display with search results, but whose values you will never search directly. Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.
UnStored fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document.
Text fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.
If you look back at the LuceneIndexExample class, you will see that I used a Text field:

document.add(Field.Text("fieldname", text));

If we wanted to change the type of field "fieldname," we would call one of the other methods of class Field:

document.add(Field.Keyword("fieldname", text));

document.add(Field.UnIndexed("fieldname", text));

document.add(Field.UnStored("fieldname", text));

Although the Field.Text, Field.Keyword, Field.UnIndexed, and Field.UnStored calls may at first look like calls to constructors, they are really just calls to different Field class methods. Table 1 summarizes the different field types.
Table 1: An overview of different field types.

Field method/type	Tokenized	Indexed	Stored
`Field.Keyword(String, String)`	No	Yes	Yes
`Field.UnIndexed(String, String)`	No	No	Yes
`Field.UnStored(String, String)`	Yes	Yes	No
`Field.Text(String, String)`	Yes	Yes	Yes
`Field.Text(String, Reader)`	Yes	Yes	No

Conclusion

In this article, we have learned about adding basic text indexing capabilities to your applications using IndexWriter and its associated classes. We have also developed a custom Analyzer that can perform Porter stemming on its input. Finally, we have looked at different field types and learned what each of them can be used for. In the next article of this Lucene series, we shall look at indexing in more depth, and address issues such as performance and multi-threading.

Source: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html

Dec 29, 2010

How to bind data using Netbeans

What is Data Binding ?

Data binding is the process that establishes a connection between the application UI and business logic. If the binding has the correct settings and the data provides the proper notifications, then, when the data changes its value, the elements that are bound to the data reflect changes automatically. Data binding can also mean that if an outer representation of the data in an element changes, then the underlying data can be automatically updated to reflect the change.

Bind data using Netbeans without coding. How ?

Ok, there is a way to bind data quickly. I will make a small example to demo.

Here I use MySQL and I'll bind data from a table in mysql to JTable using Netbeans IDE.

I created a database, name "mydb", some tables with data for demonstration

Make sure you MySQL client is running.

Follow these steps below,

Open Netbeans and create a new project with the name you want.
Register MySQL Server if you have not done yet (In the left pane >> click Services tab >> right click Database >> Register MySQL Server >> then fill with valid infomations)
Create a new JFrame with a JTable
Observe the left pane, you will see "Database connection" like this:
Choose the right Database connection then click "connect"
Then, you will see your database was created and their data tables
Now, just drag and drop from left to right
Run your project and see the result :)

Enjoy !

How to read content from URL

You can create a stream to read content from regular files.
But if it's an URL, how ?

That's simple, its different with above a little
I will give a small example,

First of all, you need some packages:



import java.io.BufferedReader;

import java.io.InputStreamReader;

import java.net.URL;

Then, you must create an URL you want to refer:

URL url= new URL("http://the link you want");

After you've successfully created a URL object, you can call the URL's openStream() method to get a stream from which you can read the contents of the URL. Still with the object above, I will create a BufferedReader to be able to read content faster:



BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

Ok, now you can start reading content from URL:
String line;

while ((line = in.readLine()) != null) {
    System.out.println(line);
}

Download example code