Ideas for creating a document management system

The client needs a document management system , and I collect information about this.

I know about sharepoint and alfresco, but in this case I evaluate the necessary information to create it from scratch, so please refrain from using any of them (we evaluate them separately, it's all about developing, not implementing an existing solution).

These are the following queries:

  • Have a very specific requirement for legal management of documents that relate to our local government, but beyond this:
  • An operation similar to google docs from an end user perspective
  • Need store information from 200+ end users (UPDATE: is it really 700 end users).
  • Mostly office documents, pdf, text. I already have plain text extraction from these binaries.
  • No wiki, no creation of a portal, only a workflow, but very simple - it’s just file management.
  • Central repository, company sharing integrated with Active Directory
  • Quick search
  • Transparent desktop integration
  • Web interface
  • Multiplataform, if possible

So this is what is on my head:

  • Storage: I know that sharepoint saves everything in db (Alfresco too?). This is a nightmare, IMHO. I prefer to put metadata in the database and files on disk.

I think about using ZFS in this case and using their features for version control, snapshots and scaling. Or maybe use git as storage (will git work fine?)

So, where can I learn more about how to process a large pool of documents, in ZFS or in any regular file system? For example, how the layout of the folder structure simplifies management and quick answers, simple backups, etc.

  • Metadata: I think the regular database is here, but ask yourself if there are more benefits than everyone in Lucene (I have some experience with Lucene, but worry because Lucene cannot be combined, rigth?).

If I use the search engine as a metadata database, I can save some work (it does not require a second pass for indexing), but the usual database mechanism is more standard.

  • Tech: I will probably build this in Django, PyLucene, Postgress and do shell integration for windows (I have no problem for this).

I will evaluate any hints or information on how to properly implement this solution.

+4
source share
3 answers

Personally, I think that the requirements are “similar to Google Docs” and “Transparent desktop integration” are a bit vague, IMHO. But judging by this question, are you more concerned about the backend and document repository and more about using a more open source stack (with AD integration)?

In any case, I personally use KnowledgeTree as our Document Management System, and their implementation is that all the files are in the file directory and the database will track the path, the corresponding metadata, access logs and version information. Basically, they contained several versions of the same file if the document was updated, which, in my opinion, is a fairly realistic implementation, given that Microsoft Office documents are mostly binary (until 2003).

You may want to understand how many documents they currently have, and how many documents they are going to enter this system daily. (Or from another point of view, which documents they plan to store usually give you hints about what kind of load your server should handle)

My guess is that, most likely, you can get away with setting up local file systems and a database storing metadata material if you are not sure that the system will handle a huge load of documents on a daily basis (imagine Flickr for documents ;)).

+1
source
  • SharePoint and Alfresco are platforms on which you can do some customization, so even using them really means that you are creating something.

  • SharePoint saves blob in the database by default, but has ways to host them on the file system

  • If you do this yourself, support the homepage extensions that Office applications use to communicate with SharePoint and Alfresco, and serve documents with the correct headers that tell IE to start the application. This way you get the same integration with the Office applications that SharePoint has (users really love this feature) - it's just a simple HTTP protocol

  • If you upgrade from SharePoint, my company is a free document viewer that can view PDF files and will soon have Office documents. We sell basic technology, but only for Windows.

  • I love Django and use it for all personal projects, but I really think that .NET and Java will have third-party support for what you need, and most of your code will be ported to SharePoint or Alfresco if you decide to go later.

EDIT : additional information on No. 3 on request

http://blogs.msdn.com/mikefitz/archive/2005/03/14/395112.aspx http://blogs.msdn.com/stcheng/archive/2008/12/17/wss-use-rpc-protocol -to-access-wss-v3-site.aspx White papers: http://msdn.microsoft.com/en-us/library/ms442469.aspx

+1
source

Alfresco should be a great solution. It maintains each of your requirements lists, except government.

But if you are building from scratch, perhaps take ideas from it, at least?

Storage: The contents of the file are saved in the file system. Easy to manage, store, backup and more. Files do not save names, although their contents are stored in binary format, and the file is called a hash (I assume a hash of the contents?)

Metadata: placed in the database. Quick access, change, update and more. Each node has properties - this is the name, name, description, dates, audit information, everything you need. This is just information, and all this is stored in the property table.

Search: Alfresco uses Solr to search, it was Lucene. I had quite large installations, and if you put the lucene index on the SSD, it quickly flashed. (lucene is fast anyway). It indexes the contents and properties of the file - so that you quickly reach the node identifier.

Alfresco implements CIFS, as well as webdav, ftp and much more. The fact is that you can simply install it on users' desktop computers in the form of folders or disks.

There is a web interface, there is a central mgmt repo, all the requirements. And since it is open source, you can get some of this source and use it in your project. Although it would be much better to take the "Alfresco Community" and contribute a bit if you feel good.

0
source

Source: https://habr.com/ru/post/1286436/


All Articles