So I decided to convert the pmwiki pages into a pbwiki toc construct. It would put all the content onto one page, and use the
- Scrape the pmwiki content index for all the meaningful links.
- Scrape out the title and urls of each link.
- Grab the content from each link.
- Reformat it all to work in the pbwiki format.
Immediately I'm unhappy with htmllib. The docs suck. And it just seems awkward to use once I figure it out. Doesn't feel Pythonic, although I'm sure I'm wrong in that respect somehow. Its just for me, my Python pseudo code often ends up being close to the end effort. And this was not the case.
Then a work buddy told me about Beautiful Soup. Its an HTML/XML parser that is real easy to use and can work with badly formed HTML, like the sort that pmwiki sometimes generates. Its not optimized for speed, but for usability. Thats fine with me, because this is a one-time operation on maybe 150-200 entries.
The final effort worked real nice. Not super fast, but real easy to code. Beautiful Soup meant what I thought would be a quick and simple task remained so.
2 comments:
Hi! This seems pretty cool; we'd love to make it easy for pmwiki users to make their home with PBwiki. Would you mind sharing your importer with us? Feel free to email me at david@pbwiki.com.
Emailed you the from my HQ NASA account. Its a first draft and I'm planning to make changes to it to make it more reusable. If you do a PHP version, let me know!
Post a Comment