index > Internet Explorer Extension Development > MSHTML based scraper

MSHTML based scraper


Hi,

I'm writing a web scraper toolkit in C++ which I'll be using to write several web scrapers. I decided to save myself some time by using MSHTML to parse pages, but I've run into a number of problems with it. I'm not sure if this is exactly the right forum for this issue, but it's the closest I could find.

Getting a single page in and parsing it is no problem. I can walk the DOM nicely and pick off whatever info I want. Pretty easy. (At least, after writing some wrappers for the otherwise laborious COM interfaces.) The problem is getting to the next page.

If I do anything that would normally cause navigation, such as inducing a click on a link or submitting a form, it launches IE to browse the page interactively. According to MSDN, this is what MSHTML will do if the application doesn't implement IHlinkFrame. So, I implemented IHlinkFrame.

To my dismay, the same thing happened! Upon further investigation, I found that MSHTML does not even query my application object for the IHlinkFrame interface. After much investigation, I could not find any explanation of this or what I was supposed to do to be able to handle navigation programmatically.

I finally gave up and decided to not induce MSHTML to navigate pages. For simple links, this is just a matter of loading the next page directly using its URL. For forms, this means actually walking the DOM to build up the submission data for the form. I begrudgingly wrote code to do this, even though I shouldn't have had to, since it's obviously something IE does routinely.

I was then assaulted by even more MSHTML drama. Once I have the POST data, for the life of me, I can't figure out how to get MSHTML to use it.

Per MSDN documentation, ordinarily, you would provide POST data in the BINDINFO structure passed to IBindStatusCallback::GetBindInfo. However, apparently IBindStatusCallback does not work with MSHTML, and you're supposed to use IPropertyNotifySink instead. However, there is no property of MSHTML that I can find that would provide POST data.

After much frustration, the only way I can think of dealing with this is to not only build the POST data myself, but also obtain the page myself directly using WININET.DLL, then save the resulting page to a local disk file, which can then be given to MSHTML.

I find it ridiculous that I should have to employ such a roundabout approach. The whole point behind using MSHTML was to avoid duplication of effort.

Any info will be greatly appreciated.

Regards,

Kevin

BrandyLove

Hi Kevin,

One choice here would be to host the webbrowser control instead of mshtml directly. Mshtml.dll is the core rendering engine but if you hsot the weboc (aka shdocvw.dll) you will get navigation support as well. See http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrowser/browser_control_node_entry.asp?frame=true

Let us know if there is a reason that the web browser control is not suitable for your purposes.

Thanks
-Dave




Dave Massy, Senior Program Manager, Internet Explorer
Dave Massy

Hi Dave,

Thank you for the reply.

Actually, I had tried hosting the web browser control, being that MSDN mentioned that it handles navigation. Unfortunately, I didn't even get as far with it as I did with MSHTML. I could at least get MSHTML to parse a single page, while I couldn't get the web browser control to load at all.

I suspect that it has something to do with the web browser control expecting that it is being hosted from a full-fledged graphical application. Meanwhile, my app is not intended to be interactive, so it doesn't even create a single window. (It's actually a console process.) It seemed as though I would have to implement all the GUI functionality of an ActiveX container in order to use the control, and this is rather excessive given the nature of the application.

Let me know if either the web browser control should be able to work without a GUI, or if there's anything I'm missing in terms of how to programmatically respond to navigation occurring in MSHTML.

Many thanks,

Kevin

BrandyLove

There is a really old MSDN sample called WalkAll that shows you how to embed a non-UI MSHTML parser combined with a navigational event sync through the HTMLWindowEvents2 connection point. It's also pretty easy to add a sync to HTMLDocumentsEvents2 or any of the other event interfaces available. It was written a long time ago, but still lists as one the best examples of how to accomplish what your trying to do.

WalkAll Source:
http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/samples/internet/browser/walkall/default.asp

Dispatch Interfaces Reference:
http://msdn.microsoft.com/workshop/browser/mshtml/reference/events/events.asp

WalkAll is a command shell app. It's based on 'old school' COM, where you implement all the interfaces yourself in C++, instead of using something like ATL, so it's self contained and has no dependencies. This might be a good place for you to start. I used it myself when I first started working with MSHTML years ago, it's quite good.

Regards,
Jim

JimMathies67

Hi Jim,

Thank you for the reply.

Actually, I did already check out the WalkAll sample, which is why I got as far as I did pretty easily. Unfortunately, the sample doesn't actually follow any links, so I wasn't able to glean the mechanisms involved.

I could set up an event sink to get scripting events like onbeforeunload, onunload, or onload, but there doesn't seem to be any obvious way to get from having a scripting event to having a moniker to pass to IPersistMoniker::Load on MSHTML in order to get the next page to actually load under programmatic control.

The only thing I can think of, is that maybe I could wait for an unload event and then call IPersistMoniker::GetCurMoniker, but I have a funny feeling that I'm just going to get a moniker to the same resource that was initially loaded. (This because whenever I do anything that would load another resource, it loads in a new process (IEXPLORE.EXE), so I assume the MSHTML I created is technically still on the first page.)

MSDN reference is very terse on this subject, and what it does mention (IHlinkFrame) doesn't even get queried for. Beyond this, I have no clue, apart from very roundabout workarounds.

Regards,

Kevin

BrandyLove
reply 5

You can use google to search for other answers

 

More Articles

• Getting Mouse Events for BHO
• Lan conection settings
• IE7 Tabs
• Selectively Removing IE History
• AdBlock for IE
• Development using visual C# -->Weak named dlls in GAC
• "IWebBrowser2::ToolBar Property" broken in IE7 RC1 ?
• IConnectionPoint::Unadvise fails in IE7
• Link replacement addin
• Point where the panel of exploration begins.
Welcome to Bokebb   New Update   Joins the collection  
 

New Articles

• Is it possible to get raw HTTP requests
• setting parameters to wmp in a web page
• Listening to IE events
• Problem in listening to Internet Explore
• IE7 Tabs
• IConnectionPoint::Unadvise fails in IE7
• Selectively Removing IE History
• BHO works in IE6 but not IE7
• Passing JScript string array to Activex
• Please a very simple sample for BHO usin
• Lan conection settings
• Link replacement addin
• problem with iexplorer ActiveX stealing
• Passing MSXML document to a activex cont
• Problem using activex control in two ins

Hot Articles

• Is it possible to get raw HTTP requests
• Development using visual C# -->Weak n
• IE 7 Quick tab problem
• Open new tab
• Please a very simple sample for BHO usin
• "IWebBrowser2::ToolBar Property&quo
• Listening to IE events
• Most Valuable IE Developer Resources?
• Why does IE7 always popup the prompt win
• How to catch certain browsing events
• problem with iexplorer ActiveX stealing
• IE 7 RC 1 and scrolling div. Need Help A
• Documentation for BHOs and Toolbars
• IE7 protected mode hWnd problem
• BHO sample code?

Recommend Articles

• Developing application hosting multiple
• Passing MSXML document to a activex cont
• Link replacement addin
• Multiple Explorer Bars
• Is it possible to intercept raw HTTP req
• Point where the panel of exploration beg
• window handle for active tab
• How to catch certain browsing events
• Why cannot BHO receive the DocumentCompl
• Want to replicate Flash player...
• Why does IE7 always popup the prompt win
• Is it possible to get raw HTTP requests
• How to automate selection of items from
• Development using visual C# -->Weak n
• How to prevent a BHO from crashing Explo