So, our Hadoop cluster works on some nodes and can only be accessed from these nodes. You SSH in them and do your job.
Since this is rather annoying, but (understandably), no one will even try to configure access control so that it can be used externally for some, I try to do the following better, that is, use SSH to start the SOCKS proxy in the cluster:
$ ssh -D localhost:10000 the.gateway cat
There is a whisper of SOCKS support (naturally, I did not find any documentation) and, apparently, goes to core-site.xml :
<property> <name>fs.default.name</name> <value>hdfs://reachable.from.behind.proxy:1234/</value></property> <property> <name>mapred.job.tracker</name> <value>reachable.from.behind.proxy:5678</value></property> <property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value></property> <property> <name>hadoop.socks.server</name> <value>localhost:10000</value></property>
Except for hadoop fs -ls / still fails, without mentioning SOCKS.
Any tips?
I am only trying to run jobs, not manage a cluster. I only need to access HDFS and send jobs via SOCKS (there seems to be a completely separate thing about using SSL / Proxies between cluster nodes, etc., I donβt want this, my machine should not be part of the cluster, just a client.)
Is there any useful documentation? To illustrate my inability to find anything useful: I found the configuration values ββby running the hadoop client via strace -f and checking the configuration files that it read.
Is there a description somewhere of which configuration values ββit even reacts to? (I literally found zero reference documentation, just differently outdated tutorials, hope I missed something?)
Is there a way to reset the configuration values ββthat it actually uses?