Search and convert HTML files and move them En-Masse

Question

Search and convert HTML files and move them En-Masse

I use Mathematica to work with a large array of website files, which I have mirrored on my own system. They are distributed across several hundred directories, with many subdirectories. So, for example, I have:

/users/me/test/directory1 /users/me/test/directory1/subdirectory2 [times a hundred] /users/me/test/directory2 /users/me/test/directory2/subdirectory5 [etc. etc.]

I need to do to go into each directory, Import[] all the HTML files as a Plaintext, and then put them in another directory in another place on my system, named after "directory1". So far, with the Do[] loops, I could make a rough version: the best case I have now is to dump the “.txt” files in the source directory, which is not an ideal solution, since they “are still distributed throughout my system.

To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];

Some additional nasty problems:

(1) Duplicates: is there a way for Mathematica to deal with duplicates - that is, if we come across another index_en.html, can it be renamed as index_en_1.html?

(2) Directories: because of all directories, if I do not use Mathematica constantly SetDirectory and CreateDirectory again and again, it runs into problems all the time.

It all seems a bit confusing. Basically, there is an effective way for Mathematica to find a ton of HTML files distributed across hundreds of directories / subdirectories, import them in plain text and export them somewhere else [it’s important for me to know that they came from directory1, but this],

- edited for clarity below -

Here is the code I have:

 SetDirectory[ "/users/me/web/"]; dirlist = FileNames[]; directoryPrefix = "/users/me/web/"; plainHTMLBucket = ""; Do[ directory = directoryPrefix <> dirname; exportPrefix = "/users/me/desktop/bucket/"; SetDirectory[directory]; allFiles = FileNames["*.htm*", {"*"}, Infinity]; plainHTMLBucket = ""; Do[ plainHTML = Import[filename, "Plaintext"]; plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML]; , {filename, allFiles}]; Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket]; Print["We Have Reached Here"]; , {dirname, dirlist}];

What is wrong with him from my point of view? Besides the fact that this is messy, this is my solution: I would prefer all the files to be divided, rather than one big one, that is, each import and export as a separate file, but in a directory called "directory1", although in another location. The problem is when it comes to mirroring these directories (directories do not exist, but I hardly use CreateDirectory[] to do this dynamically).

My apologies for the confusion here - I know what this shows with this question ..

+6

wolfram-mathematica

programming_historian Nov 15 '11 at 15:12

source share

2 answers

To set the current directory, do something like

 SetDirectory["~/Desktop/"]

Now suppose I want to get a list of all directories in the current directory. I can do

 dirs=Pick[ #, (FileType[#] == Directory) & /@ # ] &@FileNames[]

which returns a list of the names of all directories in the current directory that you installed earlier (I use nested clean functions that can be confusing ...). Then you can do fn for each of the dirs on Scan[fn,dirs] . That way, you can assign the Pick[] construct to a function, and then use it to restore your tree.

It's simple, but I'm not sure what you want. Maybe you can be more frank in that you after this, we do not sit down and are not mistaken.

+4

acl Nov 15 '11 at 16:52

source share

Wreach · Accepted Answer · 2011-11-15T19:21:30+0000

The following code can do the trick:

 mapFileNames[source_, filenames_, target_] := Module[{depth = FileNameDepth[source]} , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames ] htmlTreeToPlainText[source_, target_] := Module[{htmlFiles, textFiles, targetDirs} , htmlFiles = FileNames["*.html", source, Infinity] ; textFiles = StringReplace[ mapFileNames[source, htmlFiles, target] , f__~~".html"~~EndOfString :> f~~".txt" ] ; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles] ; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]] ; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs] ; Scan[ Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] & , Transpose[{htmlFiles, textFiles}] ] ]

Usage example ( warning : the target directory will be deleted first!):

 htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]

How it works

The various functions of Mathematica FileName... are useful in this context. First, we start by defining a mapFileNames helper function that accepts a source directory, a list of file names that are in the source directory, and the destination directory. It returns a list of file paths that indicate the corresponding locations under the destination directory.

 mapFileNames[source_, filenames_, target_] := Module[{depth = FileNameDepth[source]} , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames ]

The function uses FileNameDrop to remove the basic elements of the source path from each file name and FileNameJoin to add the target path to the beginning of each result. Number of drive elements for removal is determined by applying FileNameDepth to the source path.

For instance:

 In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"] Out[83]= {"/d/x.txt", "/d/c/y.txt"}

Using this function, we can convert the list of paths of HTML files to the source directory ( source ) into the corresponding list of paths of text files in the target directory ( target ):

 htmlFiles = FileNames["*.html", source, Infinity] textFiles = StringReplace[ mapFileNames[source, htmlFiles, target] , f__~~".html"~~EndOfString :> f~~".txt" ]

These instructions extract a list of HTML files, map them to the destination directory, and then change the file extension from .html to .txt . Now we can extract the necessary directory names from the resulting text files:

 targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]

Again, FileNameDrop used, this time to drop part of the file name from each path to the text file.

Then we need to delete the target directory (if it already exists) and create the new necessary directories:

 If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]] Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]

Now we can perform the HTML-to-text conversion, in safety, knowing that the target directories already exist:

 Scan[ Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] & , Transpose[{htmlFiles, textFiles}] ]

Search and convert HTML files and move them En-Masse

More articles: