samedi 11 janvier 2020

Selecting pattern from files and copying it to another file at appropriate place

I would really appreciate help on the following problem: Essentially I want to copy a specific pattern (the title of HTML pages, in this case marked by <h2>TITLE</h2> to an index. This index contains links to the scanned files, whose names are numbered. Specifically, I want the index to show not just links to the files titled with their number (e.g. 1.html) but also the title, e.g. "1 - Theory of Everything.html". The title is whatever is set as in the files, and this pattern does not change (not every file has a title, which is why each file needs to be searched for tags in a loop or something.

Let me give you some examples:

Sample of one of the scanned content files:

1.html content:

text
 <h2 id="theory-of-everything">Theory of Everything</h2>
text

2.html content:

text
 <h2 id="other-theory">Other Theory</h2>
text

For selecting the titles in the above example, I already got a somewhat clumsy (but working) for loop set up:

for i in *.html; do cat "$i" | grep "<h2" | grep -oP '(?<=\"\>).*(?=\<)' ; done

The output of that is, correctly:

Theory of Everything Other Theory

However, now I don't know any further. I need to get those titles of all html files into index.html, which so far looks like this: To 1.html and 2.html i would refer with the following (extract from index.html):

<p><a href="#1">1</a></p>
<p><a href="#2">2</a></p>

(so 1.html becomes #1, as it is later on integrated in another container with an iframe element containing the whole link to 1.html). The above is what the index links look like now. Now, instead of Showing just "1" or "2" as a title for the link/file, I want to add the above selected title as in:

1 - Theory of Everything 2 - Other theory

So the HTML part would have to become:

<p><a href="#1">1 - Theory of Everything</a></p>
<p><a href="#2">2 - Theory of Everything</a></p>

Unfortunately, I have no idea how to paste the selected pattern (titles) in the for loop to the correct line and place in index.html. Or if the for loop is even the right approach for what I want to do. Any help would be very appreciated.

Aucun commentaire:

Enregistrer un commentaire