40M document process (and index) as fast as possible

Have a nice day. So my problem is basically this, I need to process 37.800.000 files.

Each "file" is really bigger than mine:

  • 37.800.000 XML documents.
  • Over 120 million Tiff images.

Each of the XML documents refers to one or more Tiff images and provides a set of common keywords for the images that it represents.

What I need to create is a system that parses each of the XML files (which not only have the keywords that I need, but also a lot of garbage). For each of the files, it must store the index in the database (in the form of columns) and the path to the images (also in the database), the path only because I do not think it is a good idea to also store images inside.

The ultimate goal is that users can search for db using index keywords, and the system downloads the image or images associated with that index.

I am already creating a parser using XPath, and also defining a db schema (it's simple). But I am stuck in two things, which makes my system work very slowly and ocassionally throws SQLExceptions:

I assume that in order not to fill the pc memory during file processing, I need some kind of pagination code, but the opposite, to send the corresponding elements to db (like, say, packets every 1000 documents), as realize this is the first of my problems.

-, XML , , : (, db), , , ( , ).

- , ? Java JSP , MySQL.

.

.

"dwFileName" "FileInfo". - "DW5BasketFileName". , , , ( 001 .

- 4 .

<DWDocument DW5BasketFileName="DOCU0001.001">
  <FileInfos>
    <ImageInfos>
      <ImageInfo id="0,0,0" nPages="0">
        <FileInfo fileName="c:\bandejas\otra5\D0372001.DWTiff" dwFileName="D0001001.DWTiff" signedFileName="D0372001.DWTiff" type="normal" length="66732" />
      </ImageInfo>
    </ImageInfos>
  </FileInfos>
  <FileDatas />
  <Section number="0" startPage="0" dwguid="d3f269ed-e57b-4131-863f-51d147ae51a3">
    <Metadata version="0">
      <SystemProperties>
        <DocID>36919</DocID>
        <DiskNo>1</DiskNo>
        <PageCount>1</PageCount>
        <Flags>2</Flags>
        <StoreUser>DIGITAD1</StoreUser>
        <Offset>0</Offset>
        <ModificationUser>ESCANER1</ModificationUser>
        <StoreDateTime>2009-07-23T21:41:18</StoreDateTime>
        <ModificationDateTime>2009-07-24T14:36:03</ModificationDateTime>
      </SystemProperties>
      <FieldProperties>
        <TextVar length="20" field="NO__REGISTRO" id="0">10186028</TextVar>
        <TextVar length="20" field="IDENTIFICACION" id="1">85091039325</TextVar>
        <TextVar length="40" field="APELLIDOS" id="32">DYMINSKI MORALES</TextVar>
        <TextVar length="40" field="NOMBRES" id="33">JHONATAN OSCAR</TextVar>
        <Date field="FECHA_DEL_REGISTRO" id="64">1985-10-10T00:00:00</Date>
      </FieldProperties>
      <DatabaseProperties />
      <StoreProperties DocumentName="10/10/1985 12:00:00 a.m." />
    </Metadata>
    <Page number="0">
      <Rendition type="original">
        <Content id="0,0,0" pageNumberInFile="0" />
        <Annotation>
          <Layer id="1" z_order="0" dwguid="5c52b1f0-c520-4535-9957-b64aa7834264">
            <LayerLocation x="0" y="0" />
            <CreateUser>ESCANER1</CreateUser>
            <CreateTime>2009-07-24T14:37:28</CreateTime>
            <Entry dwguid="d36f8516-94ce-4454-b835-55c072b8b0c4">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:29</CreateTime>
              <Rectangle x="6" y="0" width="1602" height="20" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="b2381b9f-fae2-49e7-9bef-4d9cf4f15a3f">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:31</CreateTime>
              <Rectangle x="1587" y="23" width="21" height="1823" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="9917196d-4384-4052-8193-8379a61be387">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:33</CreateTime>
              <Rectangle x="0" y="1836" width="1594" height="10" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="3513e0c8-a6c9-42ec-ae9c-dc084376fcdb">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:35</CreateTime>
              <Rectangle x="0" y="0" width="23" height="1839" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
          </Layer>
          <DW4CheckSum dwCheckSum="1479972439" dwDate="131663617" dwTime="319564778" dwImageSize="66732" dwSource="0" source="" />
        </Annotation>
      </Rendition>
    </Page>
  </Section>
</DWDocument>
+3
5

, db. . , , - 120M x avg # /. :

start transaction
for each index file
    parse file
    for each keyword
        insert (keyword,imagename) into db
commit transaction

, . , , , , concurrency / .

, "n" , , 10 000:

inserts = 0
reset = 10000
start transaction
for each index file
    parse file
    for each keyword
        insert (keyword,imagename) into db
        inserts += 1
        if inserts % reset == 0
            commit transaction
            start transaction
commit transaction

- SELECT , , , .

inserts = 0
reset = 10000
start transaction
for each index file
    parse file
    if "SELECT count(*) from IMAGES where name='<insert imagename>'" == 0
        for each keyword
            insert (keyword,imagename) into db
            inserts += 1
            if inserts % reset == 0
                commit transaction
                start transaction
commit transaction

. , , , , , , imagename, . , , .

, KEYWORDS , . , , imagename. , , . ( SELECT , IMAGES , SELECT .)

+2

, , . xml 1k, 37 , . .

, , .

  • , 1000, , ( xml )
  • , .
  • xml , , .

, sql update , , . , , /, .

, xml , , , , , , , , .

+5

Solr, . Solr StreamingUpdateSolrServer .

Solr Java Lucene. Java api, .

StreamingUpdateSolrServer .

StreamingUpdateSolrServer Solr 1.4, ( 1,4 Solr build , ).

+3

, ( ), , , , , ( , xmls). , , . , , ... , .

0

Just an addendum: I would highly recommend storing images in a database. If you already start with 120 Mio image files, you will soon be at the point where even modern, modern file systems have their borders.

And you probably have to get rid of MySQL and get a real database (DB2, Oracle or SQL Server).

0
source

Source: https://habr.com/ru/post/1720240/


All Articles