Friday, February 17, 2012

Document Search for CMS

I have a database which stores documents (e.g. policies and guidelines)
in our content management system. The follow relationships exist:

Documents (1-to-Many) LinkTableKeywordLink (Many-to-1) Keywords

Documents (1-to-Many) LinkTableAttachments (Many-to-1) Attachments

Documents (1-to-Many) LookUpSubjects

Documents (1-to-Many) LookUpDocType

When the user conducts a search I want the SQL to check if the string
they enter is present in:

* the Title, Author, Summary fields of Documents OR
* the Title or Path of the Attachments
* the Keywords that are links OR
* the Subject that is linked OR
* the DocType that is linked.

Is this possible in one SQL query. I have tried the following:

SELECT DISTINCT Documents.Document, Documents.Title,Priority

FROM Documents,
Attachments,Keywords,LinkTableAttachments,LinkTabl eKeywordLink

WHERE
((Documents.Document=LinkTableAttachments.Document ID and
LinkTableAttachments.AttachmentID=Attachments.Id) OR
(Documents.Document=LinkTableKeywordLink.DocumentI D and
LinkTableKeywordLink.KeywordID=Keywords.Id))

AND (Documents.Title Like '%SEARCHSTRING%' OR Documents.Author Like
'%SEARCHSTRING%' OR Documents.Summary Like '%SEARCHSTRING%' OR
Keywords.Keyword Like '%SEARCHSTRING%' OR DocType.DocType Like
'%SEARCHSTRING%' OR Attachments.AttachmentTitle Like '%SEARCHSTRING%'
OR Attachments.Path like '%SEARCHSTRING%' or Subjects.Subject Like
'%SEARCHSTRING%')

AND StartDate<=getDate() and ReviewDate>getDate() order by Title

but this causes, perhaps understandably, a timeout error.

Any thoughts?

Thanks!
Steve(StevePBurgess@.gmail.com) writes:

Quote:

Originally Posted by

Is this possible in one SQL query. I have tried the following:
>
SELECT DISTINCT Documents.Document, Documents.Title,Priority
>
FROM Documents,
Attachments,Keywords,LinkTableAttachments,LinkTabl eKeywordLink
>
WHERE
((Documents.Document=LinkTableAttachments.Document ID and
LinkTableAttachments.AttachmentID=Attachments.Id) OR
(Documents.Document=LinkTableKeywordLink.DocumentI D and
LinkTableKeywordLink.KeywordID=Keywords.Id))
>
AND (Documents.Title Like '%SEARCHSTRING%' OR Documents.Author Like
'%SEARCHSTRING%' OR Documents.Summary Like '%SEARCHSTRING%' OR
Keywords.Keyword Like '%SEARCHSTRING%' OR DocType.DocType Like
'%SEARCHSTRING%' OR Attachments.AttachmentTitle Like '%SEARCHSTRING%'
OR Attachments.Path like '%SEARCHSTRING%' or Subjects.Subject Like
'%SEARCHSTRING%')
>
AND StartDate<=getDate() and ReviewDate>getDate() order by Title
>
>
but this causes, perhaps understandably, a timeout error.


The most crucial is probably the date conditions. This is a difficult
one there are two dates. But a start at least is make sure that the
query is parameterised, so the last line:

AND StartDate <= @.now and ReviewDate @.now order by Title

And with parameter I don't mean that you do "SELECT @.now = getdate()"
just before the query. You need it to pass @.now as a parameter to a
stored procedure or a parameterised query. Then the optimzer will
look at the value to determine how selective an index might be and
use it. If you use an unknown value, the optimizer will assume that 30%
of the rows will match, which is far too many,

If the search columns should are indexed, you should distinguish between
"starts with" and "contains" for the indexes to be useful. That is, only
include % first the search string, if the user actually wants "conatins",
because indexes are only good for starting values.

Adjust the application to have no timeout (= 0) or a very long timeout,
because in some cases the user may need to do searches that take a very long
time.

Finally, drop the DISTINCT, and use the EXISTS predicate to get from
the subtables. That is more efficient,since there no need to sort
the result to weed out the duplicates.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Hi - and thanks for the feedback.

If I run the query with against just one of the 1-many-many-1 table
relationships it works perfectly. The issue, I think, is having two
OR-ed together. When running against one set (e.g. just keywords and
not attachments) the search is very quick.

Best wishes,
Steve

Erland Sommarskog wrote:

Quote:

Originally Posted by

>
The most crucial is probably the date conditions. This is a difficult
one there are two dates. But a start at least is make sure that the
query is parameterised, so the last line:
>
AND StartDate <= @.now and ReviewDate @.now order by Title
>
And with parameter I don't mean that you do "SELECT @.now = getdate()"
just before the query. You need it to pass @.now as a parameter to a
stored procedure or a parameterised query. Then the optimzer will
look at the value to determine how selective an index might be and
use it. If you use an unknown value, the optimizer will assume that 30%
of the rows will match, which is far too many,
>
>
If the search columns should are indexed, you should distinguish between
"starts with" and "contains" for the indexes to be useful. That is, only
include % first the search string, if the user actually wants "conatins",
because indexes are only good for starting values.
>
Adjust the application to have no timeout (= 0) or a very long timeout,
because in some cases the user may need to do searches that take a very long
time.
>
Finally, drop the DISTINCT, and use the EXISTS predicate to get from
the subtables. That is more efficient,since there no need to sort
the result to weed out the duplicates.
>
>
--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
>
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx

|||(StevePBurgess@.gmail.com) writes:

Quote:

Originally Posted by

Hi - and thanks for the feedback.
>
If I run the query with against just one of the 1-many-many-1 table
relationships it works perfectly. The issue, I think, is having two
OR-ed together. When running against one set (e.g. just keywords and
not attachments) the search is very quick.


The more tables you throw into the mix, the more tables there are to search.

If you want me or anyone else to say somehting more useful, we would need to
see:

o CREATE TABLE and INDEX statements for the tables.
o The fast query (and repeat the slow while you are it).
o The output from SET STATISTICS PROFILE ON when you run the two from
a query tool. This output is very wide, so put in an attachment if
possible.
o Which version of SQL Server you are running.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||

Quote:

Originally Posted by

If you want me or anyone else to say somehting more useful, we would need to
see:
>
o CREATE TABLE and INDEX statements for the tables.
o The fast query (and repeat the slow while you are it).
o The output from SET STATISTICS PROFILE ON when you run the two from
a query tool. This output is very wide, so put in an attachment if
possible.
o Which version of SQL Server you are running.
>


Hi. Thanks. The question was really about whether the SQL is valid
(given the relationships that are defined) and/or whether it could be
written more efficiently.

No comments:

Post a Comment