Have a nice day. So my problem is basically this, I need to process 37.800.000 files.
Each "file" is really bigger than mine:
- 37.800.000 XML documents.
- Over 120 million Tiff images.
Each of the XML documents refers to one or more Tiff images and provides a set of common keywords for the images that it represents.
What I need to create is a system that parses each of the XML files (which not only have the keywords that I need, but also a lot of garbage). For each of the files, it must store the index in the database (in the form of columns) and the path to the images (also in the database), the path only because I do not think it is a good idea to also store images inside.
The ultimate goal is that users can search for db using index keywords, and the system downloads the image or images associated with that index.
I am already creating a parser using XPath, and also defining a db schema (it's simple). But I am stuck in two things, which makes my system work very slowly and ocassionally throws SQLExceptions:
I assume that in order not to fill the pc memory during file processing, I need some kind of pagination code, but the opposite, to send the corresponding elements to db (like, say, packets every 1000 documents), as realize this is the first of my problems.
-, XML , , : (, db), , , ( , ).
- , ? Java JSP , MySQL.
.
.
"dwFileName" "FileInfo". - "DW5BasketFileName". , , , ( 001 .
- 4 .
<DWDocument DW5BasketFileName="DOCU0001.001">
<FileInfos>
<ImageInfos>
<ImageInfo id="0,0,0" nPages="0">
<FileInfo fileName="c:\bandejas\otra5\D0372001.DWTiff" dwFileName="D0001001.DWTiff" signedFileName="D0372001.DWTiff" type="normal" length="66732" />
</ImageInfo>
</ImageInfos>
</FileInfos>
<FileDatas />
<Section number="0" startPage="0" dwguid="d3f269ed-e57b-4131-863f-51d147ae51a3">
<Metadata version="0">
<SystemProperties>
<DocID>36919</DocID>
<DiskNo>1</DiskNo>
<PageCount>1</PageCount>
<Flags>2</Flags>
<StoreUser>DIGITAD1</StoreUser>
<Offset>0</Offset>
<ModificationUser>ESCANER1</ModificationUser>
<StoreDateTime>2009-07-23T21:41:18</StoreDateTime>
<ModificationDateTime>2009-07-24T14:36:03</ModificationDateTime>
</SystemProperties>
<FieldProperties>
<TextVar length="20" field="NO__REGISTRO" id="0">10186028</TextVar>
<TextVar length="20" field="IDENTIFICACION" id="1">85091039325</TextVar>
<TextVar length="40" field="APELLIDOS" id="32">DYMINSKI MORALES</TextVar>
<TextVar length="40" field="NOMBRES" id="33">JHONATAN OSCAR</TextVar>
<Date field="FECHA_DEL_REGISTRO" id="64">1985-10-10T00:00:00</Date>
</FieldProperties>
<DatabaseProperties />
<StoreProperties DocumentName="10/10/1985 12:00:00 a.m." />
</Metadata>
<Page number="0">
<Rendition type="original">
<Content id="0,0,0" pageNumberInFile="0" />
<Annotation>
<Layer id="1" z_order="0" dwguid="5c52b1f0-c520-4535-9957-b64aa7834264">
<LayerLocation x="0" y="0" />
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:28</CreateTime>
<Entry dwguid="d36f8516-94ce-4454-b835-55c072b8b0c4">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:29</CreateTime>
<Rectangle x="6" y="0" width="1602" height="20" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="b2381b9f-fae2-49e7-9bef-4d9cf4f15a3f">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:31</CreateTime>
<Rectangle x="1587" y="23" width="21" height="1823" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="9917196d-4384-4052-8193-8379a61be387">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:33</CreateTime>
<Rectangle x="0" y="1836" width="1594" height="10" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="3513e0c8-a6c9-42ec-ae9c-dc084376fcdb">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:35</CreateTime>
<Rectangle x="0" y="0" width="23" height="1839" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
</Layer>
<DW4CheckSum dwCheckSum="1479972439" dwDate="131663617" dwTime="319564778" dwImageSize="66732" dwSource="0" source="" />
</Annotation>
</Rendition>
</Page>
</Section>
</DWDocument>