Getting strings at extremely high speed

I have a very large table (hundreds of millions of rows, contains numbers and rows) in Oracle, and I need to read all the contents of this table, format it and write to a file or any other resource. My solution usually looks like this:

package my.odp; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.util.concurrent.ArrayBlockingQueue; import java.util.concurrent.TimeUnit; import java.lang.Throwable; import java.sql.*; public class Main { public static volatile boolean finished = false; public static void main(final String[] args) throws InterruptedException { final ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<String>(10000); final Thread writeWorker = new Thread("ODP Writer") { public void run() { try { File targetFile = new File(args[0]); FileWriter fileWriter = new FileWriter(targetFile); BufferedWriter writer = new BufferedWriter(fileWriter); String str; try { while (!finished) { str = queue.poll(200, TimeUnit.MILLISECONDS); if (str == null) { Thread.sleep(50); continue; } writer.write(str); writer.write('\n'); } } catch (InterruptedException e) { writer.close(); return; } } catch (Throwable e) { e.printStackTrace(); return; } } }; final Thread readerThread = new Thread("ODP Reader") { public void run() { try { Class.forName("oracle.jdbc.OracleDriver"); Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass"); Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(500000); ResultSet rs = stmt.executeQuery("select * from src_schema.big_table_view"); System.out.println("Fetching result"); while (rs.next()) { StringBuilder sb = new StringBuilder(); sb.append(rs.getString(1)).append('\t');//OWNER sb.append(rs.getString(2)).append('\t');//OBJECT_NAME sb.append(rs.getString(3)).append('\t');//SUBOBJECT_NAME sb.append(rs.getLong(4)).append('\t');//OBJECT_ID sb.append(rs.getLong(5)).append('\t');//DATA_OBJECT_ID sb.append(rs.getString(6)).append('\t');//OBJECT_TYPE sb.append(rs.getString(7)).append('\t');//CREATED sb.append(rs.getString(8)).append('\t');//LAST_DDL_TIME sb.append(rs.getString(9)).append('\t');//TIMESTAMP sb.append(rs.getString(10)).append('\t');//STATUS sb.append(rs.getString(11)).append('\t');//TEMPORARY sb.append(rs.getString(12)).append('\t');//GENERATED sb.append(rs.getString(13)).append('\t');//SECONDARY sb.append(rs.getString(14)).append('\t');//NAMESPACE sb.append(rs.getString(15));//EDITION_NAME queue.put(sb.toString()); } rs.close(); stmt.close(); conn.close(); finished = true; } catch (Throwable e) { e.printStackTrace(); return; } } }; long startTime = System.currentTimeMillis(); writeWorker.start(); readerThread.start(); System.out.println("Waiting for join.."); writeWorker.join(); System.out.println("Exit:"+ (System.currentTimeMillis() - startTime)); } 

}

There are two streams: one for selecting rows from a result set and one for writing string values. The measured download speed was about 10 Mbps, and in my case I need to do this 10 times faster. The profiler shows that the most laborious methods

oracle.jdbc.driver.OracleResultSetImpl.getString ()

and

oracle.net.ns.Packet.receive ()

Do you have any idea how to make jdbc load data faster? Any ideas on optimizing queries, optimizing line loading, setting up a JDBC driver or using another one, using JDBC implementations directly in oracle, setting up Oracle is appreciated.

UPDATE: I have compiled and listed the results of the discussion below:

  • I do not have access to the DBMS server, except for connecting to Oracle db, and the server cannot connect to any external resource. Any dump and extraction utilities that use the server or the remote file system cannot be applied, and it is also impossible to install and use any external java or PL / SQL procedures on the server. Only a connection to execute queries is all.

  • I used the profiler and dug out the Oracle JDBC driver. I found out that the most expensive operation is reading data, i.e. Socket.read (). All string fields are represented as a single char array and have virtually no effect on performance. As a rule, I checked the entire application with the profiler and Socket.read () is certainly the most expensive operation. Extracting fields, building strings, writing data requires almost nothing. The problem is only reading the data.

  • Any optimizations in the presentation of data on the server side do not have a real effect. String concatenation and timestamp conversion have no effect on performance.

  • The application has been rewritten to have multiple read streams that put ready data in the write queue. Each thread has its own connection, no pools are used because they slow down the extraction (I used the UCP pool, which is recommended by the oracle, and it consumes about 10% of the execution time, so I refused it). In addition, the fetchSize result set has been increased because switching from the default value (10) to 50,000 gives up to 50% performance increase.

  • I tested how the multi-threaded version works with 4 read streams, and found that increasing the number of readers only slows down the extraction. I tried to run 2 instances, where each of them has two readers, and both worked simultaneously with one instance, i.e. Double data extraction requires a single time. I don't know why this is happening, but it seems that the oracle driver has some performance limitations. An application with 4 independent connections is slower, and then 2 instances of the application with two connections. (Profiler was used to ensure that the Socket.read () driver is still the main problem, all other parts work fine in multithreaded mode).

  • I tried to extract all the data using SAS, and it can perform the same extraction 2 times faster than JDBC, both use the same connection to Oracle and cannot use any dump operations. Oracle ensures that the thin JDBC driver is as fast as its own.

Perhaps Oracle has another way to perform a quick checkout to a remote host via ODBC or smth else?

+6
source share
3 answers

Assuming that you have already checked the basic network elements such as interfaces, firewalls, proxies, as well as the hardware elements of the database server.

Option 1:

Instead:

 Class.forName("oracle.jdbc.OracleDriver"); Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass"); 

try using:

 OracleDataSource ods = new OracleDataSource(); java.util.Properties prop = new java.util.Properties(); prop.setProperty("MinLimit", "2"); prop.setProperty("MaxLimit", "10"); String url = "jdbc:oracle:oci8:@//xxx.xxx.xxx.xxx:1521/orcl"; ods.setURL(url); ods.setUser("USER"); ods.setPassword("PWD"); ods.setConnectionCachingEnabled(true); ods.setConnectionCacheProperties (prop); ods.setConnectionCacheName("ImplicitCache01"); 

More here

Option 2: Fetchsize

As Stephen strongly pointed out, the sample seems too large.

And for a sample size of 500,000, which is your -Xms and -Xmx. Also, in the profiler, what is the maximum heap size?

Option 3: DB

  • Check indexes and query plan for src_schema.big_table_view

  • Is it a tool or an application system. If it were just a tool, you could add parallel degrees, tooltips for indexes, partitioning, etc. based on database systems

Option 4: Themes

Say n <The number of cores on the application server

You can run n Threads of writer, each of which is configured to process a specific bucket, for example. thread1 processes from 0 to 10000, writing to n different files, and after everything you have done, add a message, merge the files together using the low-level OS command.

However, all this should never be a predefined code, as it is now. 'n' and buckets should be computed at runtime. And creating more threads than your system only supports screws.

Option 5:

Instead

 select * from src_schema.big_table_view 

you can use

 SELECT column1||CHR(9)||column2||CHR(9).....||columnN FROM src_schema.big_table_view 

This avoids creating 500,000 StringBuilders and Strings . (Assuming there is no other complex formatting). CHR (9) is a tab character.

Option 6:

In the meantime, you can also check with your database administrator for any problems with the database system and raise SR using Oracle support .

+3
source

Profiling wrong

The methods you listed are most likely highly optimized. I analyzed the systems in which the most frequently called and most spent time inside StringBuffer.append() inside Oracle JDBC code, because the whole system used PreparedStatement , and it calls this method a lot ! Needless to say, in our case it was a red herring.

Profile your network traffic:

If your connection is saturated, this is your bottleneck, not the code you specified.

This needs to be done on the server side if it should be Oracle as a data source. you will never pull hundreds of millions of records over a network connection, and then again at the 10X speed you get now if you do not have 10X network cards at both endpoints and they are all connected. Even then, I'm skeptical about 10X speed

If you are really limited by Java and Oracle, the only way to get more bandwidth than you currently get is to run Java as a stored procedure on the server (s), generate the files you need and then extract them from the remote system.

I built systems that processed millions of transactions in a minute, such bandwidth does not occur on one network connection, it occurs through a network of machines with several network interfaces on dedicated send / receive switches on a dedicated subnet, allocated from the rest of the traffic in the data center.

Also

Your code is the least naive. You should never create and manage threads manually. ExecutorService has been around for 10 years, use it! ExecutorCompletionService is what you want to use in this case in almost all cases .

ListenableFuture is an even better choice if you can use Guava.

+1
source

It looks like you have already found and changed the prefetch row option. However, according to Oracle documentation:

โ€œThere is no maximum prefetch setting, but empirical evidence suggests that 10 is effective. Oracle has never seen a performance advantage for setting up prefetching above 50. If you didn't set a default value for the line prefix for the join, then 10 is the default. "

You set it to 500,000. Try to wind it up to about 50 ... as recommended by Oracle. (Why? Well, it can happen that an excessively large prefetch size causes the server or client to use excessive amounts of memory to buffer preloaded data. This can have a โ€œknock effectโ€ on other things, resulting in reduced throughput.)

Link (from Oracle 10g documentation):


You may be able to get more bandwidth by executing simultaneous queries in several Java threads (for example, in separate "sections" of the table), writing each result set to a separate stream / file. But then you have the problem of stitching output / file streams together. (And whether you will receive a general improvement will depend on the number of client and server side cores, network and NIC capacity, and disk I / O capacity.)

Also, I can't think of a way to do this faster in Java. But you can try PL / SQL or something lower. (I'm not an Oracle expert. Talk with your database administrators.)

The acceleration factor of 10 in Java is ... ambitious.

0
source

Source: https://habr.com/ru/post/973887/


All Articles