Sunday, February 19, 2012

Documentation about noise words (ignored words)

As my customers are mainly not-English-speaking, I'm used to prepare specific
versions of the file "noise.dat" for all the indexing for neutral language.
Anyway, even though I'm not including the paragraph symbol (§), I've
realized that this character is automatically considered as ignored word.
The questions are:
- Is there a way to control these ignored characters, evetually by excluding
ot including themn in the full-text catalog?
- Where can I find a detailed documentation of all these
language-independent characters that are ignored?
Thanks in advance,
Angelo
There is no real description of the noise word lists or the treatment of
special characters. In general all punctuation marks are ignored with some
exceptions.
Here is a good description of how noise words are indexed.
http://msdn.microsoft.com/library/de...nario_8k4z.asp
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html
"Angelo B" <Angelo B@.discussions.microsoft.com> wrote in message
news:87641E91-C218-4DFD-9CFF-495F5B6EF3C0@.microsoft.com...
> As my customers are mainly not-English-speaking, I'm used to prepare
specific
> versions of the file "noise.dat" for all the indexing for neutral
language.
> Anyway, even though I'm not including the paragraph symbol (), I've
> realized that this character is automatically considered as ignored word.
> The questions are:
> - Is there a way to control these ignored characters, evetually by
excluding
> ot including themn in the full-text catalog?
> - Where can I find a detailed documentation of all these
> language-independent characters that are ignored?
> Thanks in advance,
> Angelo
|||Hi Hilary
Your answer replied my question but but didn't help the issue.
The issue is that our customers are governative authorities, which make
large use of § (paragraph character) in the documents MSSearch indexes. As
this character is a prefix for law numbers, such as "the article expressed in
the § 12 of law # 2340-124", they pretend that by typing "§ 12" in the search
form they get all documents containing "§ 12", not only "12", otherwise they
get thousands of documents that have nothing to do with "paragraph 12".
I wonder why has MS chosen not to index these characters. If they exist,
they are also used in documents, and they should be searchable.
The noise.dat should give the possibility to treat these characters as
ignored word/character or not.
It's a serious issue...
Are you sure that it's not possible to include/exclude these characters? Is
there a key in the Registry, maybe?
Thanks in advance,
Angelo
"Hilary Cotter" wrote:

> There is no real description of the noise word lists or the treatment of
> special characters. In general all punctuation marks are ignored with some
> exceptions.
> Here is a good description of how noise words are indexed.
> http://msdn.microsoft.com/library/de...nario_8k4z.asp
>
> --
> Hilary Cotter
> Looking for a book on SQL Server replication?
> http://www.nwsu.com/0974973602.html
>
> "Angelo B" <Angelo B@.discussions.microsoft.com> wrote in message
> news:87641E91-C218-4DFD-9CFF-495F5B6EF3C0@.microsoft.com...
> specific
> language.
> excluding
>
>
|||I realize I was not of much help.
Someone approached me some time ago about doing something similar. What I
recommended they do is to replace all the unindexable tokens with another
token that is indexed, ie xxxx. Then when they were searching on this
unindexable character the client would replace it with the xxxx and they
would get the results they were looking for.
They had two columns storing the content, one where the unindexable
character was replaced by a searchable token (xxxx), and the other which
contains the actual content. The one where the unindexable character was
replaced was the column indexed, the other the one returned in searches.
HTH
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html
"Angelo B" <AngeloB@.discussions.microsoft.com> wrote in message
news:C6573FCA-5B09-45A8-A765-CED8891A1F81@.microsoft.com...
> Hi Hilary
> Your answer replied my question but but didn't help the issue.
> The issue is that our customers are governative authorities, which make
> large use of (paragraph character) in the documents MSSearch indexes. As
> this character is a prefix for law numbers, such as "the article expressed
in
> the 12 of law # 2340-124", they pretend that by typing " 12" in the
search
> form they get all documents containing " 12", not only "12", otherwise
they
> get thousands of documents that have nothing to do with "paragraph 12".
> I wonder why has MS chosen not to index these characters. If they exist,
> they are also used in documents, and they should be searchable.
> The noise.dat should give the possibility to treat these characters as
> ignored word/character or not.
> It's a serious issue...
> Are you sure that it's not possible to include/exclude these characters?
Is[vbcol=seagreen]
> there a key in the Registry, maybe?
> Thanks in advance,
> Angelo
>
> "Hilary Cotter" wrote:
some[vbcol=seagreen]
http://msdn.microsoft.com/library/de...nario_8k4z.asp[vbcol=seagreen]
word.[vbcol=seagreen]

No comments:

Post a Comment