Using Hadoop through a SOCKS proxy?

So, our Hadoop cluster works on some nodes and can only be accessed from these nodes. You SSH in them and do your job.

Since this is rather annoying, but (understandably), no one will even try to configure access control so that it can be used externally for some, I try to do the following better, that is, use SSH to start the SOCKS proxy in the cluster:

$ ssh -D localhost:10000 the.gateway cat 

There is a whisper of SOCKS support (naturally, I did not find any documentation) and, apparently, goes to core-site.xml :

 <property> <name>fs.default.name</name> <value>hdfs://reachable.from.behind.proxy:1234/</value></property> <property> <name>mapred.job.tracker</name> <value>reachable.from.behind.proxy:5678</value></property> <property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value></property> <property> <name>hadoop.socks.server</name> <value>localhost:10000</value></property> 

Except for hadoop fs -ls / still fails, without mentioning SOCKS.

Any tips?


I am only trying to run jobs, not manage a cluster. I only need to access HDFS and send jobs via SOCKS (there seems to be a completely separate thing about using SSL / Proxies between cluster nodes, etc., I don’t want this, my machine should not be part of the cluster, just a client.)

Is there any useful documentation? To illustrate my inability to find anything useful: I found the configuration values ​​by running the hadoop client via strace -f and checking the configuration files that it read.

Is there a description somewhere of which configuration values ​​it even reacts to? (I literally found zero reference documentation, just differently outdated tutorials, hope I missed something?)

Is there a way to reset the configuration values ​​that it actually uses?

+6
source share
1 answer

The source code for implementing this has been added at https://issues.apache.org/jira/browse/HADOOP-1822

But this article also notes that you need to change the socket class to SOCKS

http://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-using-a-proxy/

with

<property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value> </property>

Edit: note that the properties go in different files:

  • fs.default.name and hasoop.socks.server and hasoop.rpc.socket.factory.class.default need to go to core-site.xml
  • mapred.job.tracker and mapred.job.tracker.http.address config must go to mapred-site.xml (for configuration with map reduction)
+5
source

Source: https://habr.com/ru/post/973202/


All Articles