|
Hi,
I'm writing a web scraper toolkit in C++ which I'll be using to write several web scrapers. I decided to save myself some time by using MSHTML to parse pages, but I've run into a number of problems with it. I'm not sure if this is exactly the right forum for this issue, but it's the closest I could find.
Getting a single page in and parsing it is no problem. I can walk the DOM nicely and pick off whatever info I want. Pretty easy. (At least, after writing some wrappers for the otherwise laborious COM interfaces.) The problem is getting to the next page.
If I do anything that would normally cause navigation, such as inducing a click on a link or submitting a form, it launches IE to browse the page interactively. According to MSDN, this is what MSHTML will do if the application doesn't implement IHlinkFrame. So, I implemented IHlinkFrame.
To my dismay, the same thing happened! Upon further investigation, I found that MSHTML does not even query my application object for the IHlinkFrame interface. After much investigation, I could not find any explanation of this or what I was supposed to do to be able to handle navigation programmatically.
I finally gave up and decided to not induce MSHTML to navigate pages. For simple links, this is just a matter of loading the next page directly using its URL. For forms, this means actually walking the DOM to build up the submission data for the form. I begrudgingly wrote code to do this, even though I shouldn't have had to, since it's obviously something IE does routinely.
I was then assaulted by even more MSHTML drama. Once I have the POST data, for the life of me, I can't figure out how to get MSHTML to use it.
Per MSDN documentation, ordinarily, you would provide POST data in the BINDINFO structure passed to IBindStatusCallback::GetBindInfo. However, apparently IBindStatusCallback does not work with MSHTML, and you're supposed to use IPropertyNotifySink instead. However, there is no property of MSHTML that I can find that would provide POST data.
After much frustration, the only way I can think of dealing with this is to not only build the POST data myself, but also obtain the page myself directly using WININET.DLL, then save the resulting page to a local disk file, which can then be given to MSHTML.
I find it ridiculous that I should have to employ such a roundabout approach. The whole point behind using MSHTML was to avoid duplication of effort.
Any info will be greatly appreciated.
Regards,
Kevin
|