Friday, February 24, 2012

Documents in VLDB

Hello All,

I have been experimenting with SQL Server 2005 partitions. I loaded a terabyte of information into 2 tables. The first holds the document information and the second holds the actual binary document (in this case pdf). Most of the documents are about 1 megabyte in size, but the largest is 212 megabytes.

SQL Server has no problem storing the blobs. The problem occurs when I attempt to get the data.

I did some quick tests to test how fast I could pull the documents out. The largest took about 24 seconds. The 1 meg documents are sub-second.

Here is how the 212 meg doc breaks down:

Time to load datatable: 18.79 seconds

Time to load byte array: 3.84 seconds

Time to Write and open document: 0.01 seconds

If I access the file from a file server, the time is 0.04 seconds to begin showing the document.

As you can see, the longest time period is related to retrieving the data from SQL, and it is much slower that launching it from disk across the network. (note: the sql server and file server used to test are next to each other).

My question is, how can I speed up the access from SQL Server? I believe the keys are "partition aligned". Any suggestions would be appreciated.

I will add the table definitions and partition information as a reply since only 5000 chars are allowed in the post.

Here are the table definitions. The table tblDocInfo has DocInfoID, which is the FK in tblDocs. The partitions are defined by this key.

Info table:

CREATE TABLE [dbo].[tblDocInfo](

[DocInfoID] [decimal](18, 0) IDENTITY(1,1) NOT NULL,

[DocName] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[DocSize] [decimal](18, 0) NULL,

[FKID] [decimal](18, 0) NULL,

[FKTableName] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[FKDBName] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[FKDBServer] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[OriginalLocationPath] [nvarchar](500) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[CreationDate] [datetime] NOT NULL CONSTRAINT [DF_tblDocInfo_CreationDate] DEFAULT (getdate()),

[UserID] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Computer] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[LastModifiedDate] [datetime] NULL CONSTRAINT [DF_tblDocInfo_LastModifiedDate] DEFAULT (getdate()),

CONSTRAINT [PK_tblDocInfo] PRIMARY KEY CLUSTERED

(

[DocInfoID] ASC

)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]

) ON [PRIMARY]

GO

ALTER TABLE [dbo].[tblDocInfo] WITH CHECK ADD CONSTRAINT [FK_tblDocInfo_tblDocInfo] FOREIGN KEY([DocInfoID])

REFERENCES [dbo].[tblDocInfo] ([DocInfoID])

GO

ALTER TABLE [dbo].[tblDocInfo] CHECK CONSTRAINT [FK_tblDocInfo_tblDocInfo]

Binary Table:

CREATE TABLE [dbo].[tblDocs](

[DocID] [decimal](18, 0) IDENTITY(1,1) NOT NULL,

[DocInfoID] [decimal](18, 0) NOT NULL,

[Doc] [image] NULL,

CONSTRAINT [PK_tblDocs] PRIMARY KEY CLUSTERED

(

[DocInfoID] ASC

)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [partscheme]([DocInfoID])

) ON [partscheme]([DocInfoID])

GO

ALTER TABLE [dbo].[tblDocs] WITH CHECK ADD CONSTRAINT [FK_tblDocs_tblDocInfo] FOREIGN KEY([DocInfoID])

REFERENCES [dbo].[tblDocInfo] ([DocInfoID])

GO

ALTER TABLE [dbo].[tblDocs] CHECK CONSTRAINT [FK_tblDocs_tblDocInfo]

|||Here are the filegroups:
Name FileID FileName FileGroup Size Maxsize Growth Usage
Doc_main 1 e:\mssqldata\Doc.mdf PRIMARY 116736 KB Unlimited 1024 KB data only
Doc_Log 2 f:\MSSQLDataLog\Doc_log.ndf NULL 2376256 KB 2147483648 KB 10% log only
FG_default 3 e:\mssqldata\fg.ndf FG 5120 KB Unlimited 1024 KB data only
FG1_dat 4 e:\mssqldata\fg1.ndf FG1 136236032 KB Unlimited 1024 KB data only
FG2_dat 5 e:\mssqldata\fg2.ndf FG2 110420992 KB Unlimited 1024 KB data only
FG3_dat 6 e:\mssqldata\fg3.ndf FG3 110420992 KB Unlimited 1024 KB data only
FG4_dat 7 e:\mssqldata\fg4.ndf FG4 110420992 KB Unlimited 1024 KB data only
FG5_dat 8 e:\mssqldata\fg5.ndf FG5 110420992 KB Unlimited 1024 KB data only
FG6_dat 9 e:\mssqldata\fg6.ndf FG6 110420992 KB Unlimited 1024 KB data only
FG7_dat 10 e:\mssqldata\fg7.ndf FG7 110420992 KB Unlimited 1024 KB data only
FG8_dat 11 e:\mssqldata\fg8.ndf FG8 103505920 KB Unlimited 1024 KB data only
FG9_dat 12 e:\mssqldata\fg9.ndf FG9 25600 KB Unlimited 1024 KB data only
FG10_dat 13 e:\mssqldata\fg10.ndf FG10 25600 KB Unlimited 1024 KB data only
FG11_dat 14 e:\mssqldata\fg11.ndf FG11 25600 KB Unlimited 1024 KB data only
FG12_dat 15 e:\mssqldata\fg12.ndf FG12 25600 KB Unlimited 1024 KB data only
FG13_dat 16 e:\mssqldata\fg13.ndf FG13 25600 KB Unlimited 1024 KB data only
FG14_dat 17 e:\mssqldata\fg14.ndf FG14 25600 KB Unlimited 1024 KB data only
FG15_dat 18 e:\mssqldata\fg15.ndf FG15 25600 KB Unlimited 1024 KB data only
FG16_dat 19 e:\mssqldata\fg16.ndf FG16 25600 KB Unlimited 1024 KB data only
FG17_dat 20 e:\mssqldata\fg17.ndf FG17 25600 KB Unlimited 1024 KB data only
FG18_dat 21 e:\mssqldata\fg18.ndf FG18 25600 KB Unlimited 1024 KB data only
FG19_dat 22 e:\mssqldata\fg19.ndf FG19 25600 KB Unlimited 1024 KB data only
FG20_dat 23 e:\mssqldata\fg20.ndf FG20 25600 KB Unlimited 1024 KB data only
FG21_dat 24 e:\mssqldata\fg21.ndf FG21 25600 KB Unlimited 1024 KB data only
FG22_dat 25 e:\mssqldata\fg22.ndf FG22 25600 KB Unlimited 1024 KB data only
FG23_dat 26 e:\mssqldata\fg23.ndf FG23 25600 KB Unlimited 1024 KB data only
FG24_dat 27 e:\mssqldata\fg24.ndf FG24 25600 KB Unlimited 1024 KB data only
FG25_dat 28 e:\mssqldata\fg25.ndf FG25 25600 KB Unlimited 1024 KB data only
FG26_dat 29 e:\mssqldata\fg26.ndf FG26 25600 KB Unlimited 1024 KB data only
FG27_dat 30 e:\mssqldata\fg27.ndf FG27 25600 KB Unlimited 1024 KB data only
FG28_dat 31 e:\mssqldata\fg28.ndf FG28 25600 KB Unlimited 1024 KB data only
FG29_dat 32 e:\mssqldata\fg29.ndf FG29 25600 KB Unlimited 1024 KB data only
FG30_dat 33 e:\mssqldata\fg30.ndf FG30 25600 KB Unlimited 1024 KB data only
FG31_dat 34 e:\mssqldata\fg31.ndf FG31 25600 KB Unlimited 1024 KB data only
FG32_dat 35 e:\mssqldata\fg32.ndf FG32 25600 KB Unlimited 1024 KB data only
Here is how the data is spread across the partitions (only 8 partitions in use)
Partition No Min ID Max ID Rows in Partition
1 1 100000 99997
2 100001 200000 100000
3 200001 300000 100000
4 300001 400000 100000
5 400001 500000 100000
6 500001 600000 100000
7 600001 700000 100000
8 700001 793738 93737|||

Ike,

Could you provide with two more pieces of information:

1) The query that you use to retrieve the data

2) Is this 24 seconds the time it takes on the first or the subsequent tries?

Thanks,

Humberto Acevedo

SQL server team

|||

First ideea: move that filegroups on other drives (see used only e:\mssqldata\ ...) to see a possible change.

Second : why don't you create a partition and scheme function based on [DocSize] ( after you add it to table tblDocs) ? You can put big files on filegroup located on different disk.

|||

Hi Humberto,

The 24 seconds was on subsequent tries. The time varied between 30 and 24 seconds...

Here is the code in the vb app...

Dim starttime As DateTime

Dim interimTime As DateTime

Dim endtime As DateTime

starttime = Now

interimTime = Now

endtime = Now

Dim msg As String

Dim dt As DataTable

Dim DBConn As New System.Data.SqlClient.SqlConnection

DBConn.ConnectionString = Me.TextBox3.Text

DBConn.Open()

endtime = Now

msg = msg & "Time to open connection: " & ((endtime - interimTime).TotalSeconds).ToString & " seconds" & vbCrLf

interimTime = Now

Dim cmSql As New SqlClient.SqlCommand

cmSql.Connection = DBConn

cmSql.CommandType = CommandType.Text

cmSql.CommandText = "select doc from tbldocs where docinfoid = " & ID.ToString

cmSql.CommandTimeout = 90

Dim ds As New DataSet

Dim MyAdapter As New SqlClient.SqlDataAdapter(cmSql)

Try

MyAdapter.Fill(ds)

dt = ds.Tables(0)

endtime = Now

msg = msg & "Time to load datatable: " & ((endtime - interimTime).TotalSeconds).ToString & " seconds" & vbCrLf

interimTime = Now

Catch ex As Exception

MsgBox(ex.Message)

End Try

If dt.Rows.Count = 0 Then

dt = Nothing

'DAL = Nothing

Exit Sub

End If

Dim ba As Byte() = dt.Rows(0).Item(0)

Dim DocName As String = "\doc" & ID & ".pdf"

My.Computer.FileSystem.WriteAllBytes(My.Application.Info.DirectoryPath & DocName, ba, False)

endtime = Now

msg = msg & "Time to Load Byte Array: " & ((endtime - interimTime).TotalSeconds).ToString & " seconds" & vbCrLf

interimTime = Now

Diagnostics.Process.Start(My.Application.Info.DirectoryPath & DocName)

endtime = Now

msg = msg & "Time to open document: " & ((endtime - interimTime).TotalSeconds).ToString & " seconds" & vbCrLf

interimTime = Now

dt = Nothing

'DAL = Nothing

msg = msg & "Total Time: " & ((endtime - starttime).TotalSeconds).ToString & " seconds" & vbCrLf

MsgBox(msg)

|||

Hi ggcicubuc,

Thanks for the post.

I don't really have other drives we can use. The raid 5 + other data uses what we have, at least for this test. Do you know if spreading out on the drives would have a drastic impact? The file system fetch was sub-second, lending me to believe that the disks are not the issue (at least raw I/O).

To further explain our intended use, the inputs will be fairly low in number (100 to 500 per day max), but the reads will be up to 10x that.

Finally, can you explain why redoing the partition to group by size cause the retrieval to speed up?

Thanks again,

Ike

|||

Yes, using raid lend me too to think that the disk system is not the issue.

I quote from Books Online (look for "image data type, data management")

"The database APIs follow a common pattern in the ways they handle long ntext, text, and image columns:

To read a long column, the application simply includes the ntext, text, or image column in a select list, and then binds the column to a program variable large enough to hold a reasonable block of the data. The application then executes the statement and uses an API function or method to retrieve the data into the bound variable one block at a time."|||

Hi ggciubuc,

I am struggling to understand all of the ramifications of how I am accessing the data, so please bear with me...

I am assuming that

1. My VB dot net sqlcommand object is working with the database api's to retrieve the data block by block (see my VB code post)

2. The index points to the binary table so in one read the api's begin accessing the blob by blocks

Isn't an individual document limited to one partition and one file in a filegroup? And isn't that file (all of its "blocks") also contiguous within the file so that the reads (by the api) bring in the blocks in the order they are on disk?

If that is the case, it would seem that having the larger blobs gouped together in a partition vs. grouping them by DocInfoID would not matter if the seek to find the beginning of the blob is one read.

Please let me know if there is a flaw in my reasoning here. I hesitate to reload the data because of the amount of time it took...and I am not sure how else I could re-partition without having to reload the data.

And, thanks again for your time...

|||

I should do a test scenario :

-create a test database with that 2 tables

-copy only , let'say 20-30 docs with 1mb size and 2-3 docs with 200 mb size

-create a partion and scheme function based on docsize

-test reading/writing

|||That I can do...what about my earlier reply? Any comments?

|||

As Books Online say (look for "image data type") : "Data in an image data is stored as a string of bits and is not interpreted by SQL Server. Any interpretation of the data in an image column must be made by the application."

and "The data row contains only a 16-byte text pointer that points to the root node of a tree built of internal pointers that map the pages in which the string fragments are stored".

You say something equivalent but actually, i think this grouping of files with some size can conduct to more speed.

Finally you have to try and check it !

No comments:

Post a Comment