The following code can do the trick:
mapFileNames[source_, filenames_, target_] := Module[{depth = FileNameDepth[source]} , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames ] htmlTreeToPlainText[source_, target_] := Module[{htmlFiles, textFiles, targetDirs} , htmlFiles = FileNames["*.html", source, Infinity] ; textFiles = StringReplace[ mapFileNames[source, htmlFiles, target] , f__~~".html"~~EndOfString :> f~~".txt" ] ; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles] ; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]] ; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs] ; Scan[ Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] & , Transpose[{htmlFiles, textFiles}] ] ]
Usage example ( warning : the target directory will be deleted first!):
htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]
How it works
The various functions of Mathematica FileName... are useful in this context. First, we start by defining a mapFileNames helper function that accepts a source directory, a list of file names that are in the source directory, and the destination directory. It returns a list of file paths that indicate the corresponding locations under the destination directory.
mapFileNames[source_, filenames_, target_] := Module[{depth = FileNameDepth[source]} , FileNameJoin[{target, FileNameDrop[
The function uses FileNameDrop to remove the basic elements of the source path from each file name and FileNameJoin to add the target path to the beginning of each result. Number of drive elements for removal is determined by applying FileNameDepth to the source path.
For instance:
In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"] Out[83]= {"/d/x.txt", "/d/c/y.txt"}
Using this function, we can convert the list of paths of HTML files to the source directory ( source ) into the corresponding list of paths of text files in the target directory ( target ):
htmlFiles = FileNames["*.html", source, Infinity] textFiles = StringReplace[ mapFileNames[source, htmlFiles, target] , f__~~".html"~~EndOfString :> f~~".txt" ]
These instructions extract a list of HTML files, map them to the destination directory, and then change the file extension from .html to .txt . Now we can extract the necessary directory names from the resulting text files:
targetDirs = DeleteDuplicates[FileNameDrop[
Again, FileNameDrop used, this time to drop part of the file name from each path to the text file.
Then we need to delete the target directory (if it already exists) and create the new necessary directories:
If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]] Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
Now we can perform the HTML-to-text conversion, in safety, knowing that the target directories already exist:
Scan[ Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] & , Transpose[{htmlFiles, textFiles}] ]