Rantings of a Selenium Contributor: 2012

Monday, December 24, 2012

Seeing "Info" Messages in a Log Does Not Automatically Imply "Error"

Here's what seems to be a common question among Java language bindings users of the IE driver (tl;dr):

Whenever I use the IE driver, I see an error in my log. The log output is below. How do I stop it?

Started InternetExplorerDriver server (32-bit)
2.28.0.0
Listening on port 48630
Sep 23, 2012 9:12:21 AM org.apache.http.impl.client.DefaultRequestDirector tryExecute
INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond
Sep 23, 2012 9:12:21 AM org.apache.http.impl.client.DefaultRequestDirector tryExecute
INFO: Retrying request

Follow-up questions like, "What happens after you receive this message?" or, "Does the rest of your code run?" usually get a response like, "No, the rest of my code works fine with IE, I just see this error message, and I don't think I should be seeing any errors."

There are two answers to this question, a short one and a long one. The short one is, "Read the log message. It's clearly tagged as 'INFO', as in an informational message, and not indicative of any problem with the code?" I find that this question often comes from users of Eclipse, and that the Eclipse console has colored the message red, and people are so conditioned to see "red == bad" that they react to the format of the message rather than the content. The content of the message is flagged at a level that means, "Hey, nothing is wrong, we're just telling you about it."

If all you care about is how to get around the message, you can stop reading here. The longer answer to the question involves an explanation of why you might receive this message. The explanation requires a little knowledge of the architecture of the WebDriver language bindings in general, and the IE driver in particular. Remember that in the language bindings for most browsers, communication with the browser happens via a "server component", and the language bindings use a JSON-over-HTTP protocol to communicate with that server component. The server component could be an instance of a remote WebDriver server (whether written in Java, or another language); or it could be the Firefox browser extension that the FirefoxDriver installs into an anonymous profile and uses; or it could be an instance of chromedriver.exe; or, and this is the important bit for our purposes here, it could be an instance of IEDriverServer.exe. Each of these server components starts up an HTTP server with which the language bindings can communicate.

Let's think about what happens when you ask the Java language bindings to create a new instance of the InternetExplorerDriver class.

Locate the IEDriverServer.exe, either from the system PATH or from a Java system property
Launch IEDriverServer.exe, including its internal HTTP server
Use the org.apache.http.client.HttpClient class to send HTTP requests to the IEDriverServer's internal HTTP server to launch IE and establish a WebDriver session

Now looking at those steps, can you see the potential flaw in the execution? If you said "race condition," you earn a gold star. The Java method for launching a process will return before the IEDriverServer's HTTP server has had a chance to fully initialize itself and be ready to receive HTTP requests. However, since IEDriverServer.exe is running in a separate process, there's no way for the Java language bindings to know when the HTTP server has fully initialized. The only way to handle the situation is to poll for the HTTP server, repeatedly sending requests until you get a valid response, and that's the message you're seeing in the log. All the bindings are telling you is that the HTTP server hasn't yet been fully initialized, so it's trying the request again, nothing to see here, move along.

The bottom line is that the message is informational, not indicative of an error, or else it would have a different logging level. Remember, the format of a log message is incidental; the important thing is its content.

Wednesday, December 5, 2012

Not Providing an HTML Page? Think of the Kittens!

Consider the following request for help with Selenium WebDriver:

I'm having trouble getting something working in WebDriver. I'm getting a NoSuchElementException when I try to execute the following code:
driver.get("http://xxxxxx/");
driver.findElement(By.id("userName")).sendKeys("user1234");
What could be causing this exception, and how do I fix it?

Obviously, this code could have several potential problems. The ID attribute of the element may be wrong. The element the user is looking for could be inside a frame or iframe. The element may be added to the page DOM via some JavaScript library, and the browser hasn't had time to process the JavaScript yet when the findElement call is made. Or it could be something else. What's missing from this request for help, whether the request is made in a post to one of the user-facing mailing lists, or in an issue report, is context. Anyone reading the request won't know the structure of the page the user is automating, so any attempt at help is going to be pure speculation.

Most of the time, people wanting to help the requester will respond with something along the lines of, "Hey, we really can't give you any guidance unless we can see a page that reproduces the problem. Can you share the page, or a URL to a page that will demonstrate the issue?" Surprisingly often, this simple request for more context is met with a very large amount of resistance, along with reasons why it isn't possible. Almost equally as often, these reasons are bogus. Let's look at a few of these reasons, and see why they're bogus.

Evans Bogus Reason 1: "It's hosted on an intranet."

Just because a webpage happens to be hosted on an intranet doesn't mean it can't be shared. Yes, it will require some effort on your part to make it public. Such applications usually rely on some sort of database connection, or connection to internal resources, or LDAP integration, or some other such thing. Here's a little secret: Because WebDriver works only on the UI, your WebDriver problem most likely has nothing whatsoever to do with those underlying technologies. Yes, that means you'll probably have to stub out anything like that your application relies on before you can make it public, but it can be done.

Evans Bogus Reason 2: "We're so afraid someone's gonna steal our super-secret site concept/source code"

This reason is often imposed from on high by an upper management drone. It's penny-wise and pound-foolish. You're paying someone good money to develop automated code against your web application. The people you're paying are encountering difficulty to the degree that they need outside help to figure out why. Yet you're so scared that someone is going to steal your precious code or idea that you refuse to let them get help, and causing that money to be wasted because your automators aren't being productive.

I direct this next bit to those short-sighted nincompoops who refuse to get out of their people's way. If you're going to insist on making it hard on your WebDriver coders who are caught between a complex web application and your tunnel-vision, I'll make an offer here and now to sign whatever NDA is required to get them some help; however, I'll charge a fee for it. A big one. Much, much more than if you had just let your coders get the help they need in the first place.

Evans Bogus Reason 3: "Do you know how complicated my site's JS framework/CSS/HTML code is?"

The snarky jerk in me always wants to respond to this reason with something like, "Well, no, I don't know how complicated it is because I can't see your code from here, but it's highly likely that complexity is exactly your problem." Simplify the page down to plain, vanilla HTML if you can. It's not like WebDriver is entirely untested. There's a CI server that runs a bunch of tests (600+ at the time of this writing) on every single commit, against 5 version of IE, 4 versions of Firefox, Chrome, and Opera. If you're doing something reasonably common in your WebDriver code, it's probably a safe bet that it's not going to be globally broken for everybody, everywhere, against every web page.

Also, there's no requirement that someone needs to use the exact same site that's giving them the problem. One that is similar will do the trick, which is why we who try to provide help often say ask for "an HTML page, or a public URL to a page" that demonstrates the problem. Here's a tip, most of the JavaScript UI frameworks provide public demo pages. See if you can replicate the behavior from your WebDriver code against the demo page.

Evans Bogus Reason 4: "But I'm just a lowly tester, and I'm not allowed to touch the application code"

I can't believe this is actually put forth as a reason, since the solutions are many and obvious. Work on a private fork of the application code. Enlist a developer to help you. Try to reproduce your problem against a different, public site.

Remember, We're All Volunteers Here

The people who develop and provide support for WebDriver are unpaid volunteers. If you're having an issue with running WebDriver against your web application, and you need to ask for help why it's failing for you, it'll go a long way toward establishing goodwill (and getting you a productive answer) if you provide everything someone needs to reproduce your problem easily.

Monday, November 19, 2012

Are you kidding me, IE Driver? Another freaking thing to download?

Every now and again, I come across someone who remembers that the IE driver used to be completely self-contained for all of the supported language bindings, and realizes they need to download IEDriverServer.exe for the driver to work in more recent versions. Often, the question comes up as to why there's now a separate executable. The question is sometimes phrased like this:

You mean I have to download yet another component to use the IE driver now? I hated it when the Chrome driver forced me to do the same thing, and now you want me to go through even more hoops? Man, you Selenium developer guys really suck!

Admittedly, the need to download another component is a slight inconvenience. However, there were a few good reasons why this path was chosen.

The original implementation of the native (C++) code of the IE driver was in a .dll. In most of the client language bindings, this .dll was extracted at runtime from whatever packaging solution was appropriate for the language. For Java, this meant extracting it from the .jar; for .NET, this meant extracting it from a resource packed into the WebDriver.dll assembly. Ruby and Python packaging mostly relied on files laid out on disk, so no extraction was needed, just a reference to the path of the .dll. The language binding would then use its native API (JNI for Java, P/Invoke for .NET, ctypes for Python, FFI for Ruby) to load the .dll and call the exposed API for starting the "server" portion of the IE driver.

This worked alright for a time, especially in simple scenarios, but the development team eventually started noticing subtle differences in behavior between language bindings. For example, because of the way the .NET bindings loaded and managed the native .dll, it was able to support simultaneous multiple instances of IE, while the Java bindings did not. The Ruby bindings should have supported multiple instances, because it modeled its native code management after the .NET mechanism, but the native code interface the bindings used, FFI, didn't allow a loaded native library to be unloaded (i.e., it didn't support a call to the FreeLibrary Win32 API). All of this came down to the fact that each language's native code interaction method had slightly different semantics from the others.

Something had to be done to unify the user experience across languages. As it happens, using a separate process is a much more consistent story, because each language binding can use its process management API. Why does that result in more consistency? Because process management is defined by the operating system, the various versions of Windows in the case of the IE driver.

A happy side effect of moving to a separate executable is in the realm of 32-bit vs. 64-bit versions of the browser. It's a limitation of Windows that a 32-bit process cannot load a 64-bit .dll, and vice versa. So, this meant that if your language runtime was 64-bit, you could only run your WebDriver code against the 64-bit version of IE, and if it was important to you to run your WebDriver code against the 32-bit version of IE, you were out of luck. If this sounds like a far-fetched scenario, I'll point out that this was the default situation for .NET running on a 64-bit version of Windows. However, there is no restriction on a 64-bit process launching a 32-bit process. With the introduction of the standalone executable, it would be possible to run WebDriver code against a 32-bit or a 64-bit version of IE, no matter what the "bitness" of your language runtime.

Also, by moving the executable outside the normal delivery mechanism of the language bindings, delivery of the IE driver core is decoupled from full Selenium releases. That means it's possible to ship fixes in the IE driver without having to wait for a full release of Selenium. Since the introduction of the standalone IEDriverServer.exe executable, we've used this ability to deliver on bug fixes and functionality updates between releases of the language bindings.

Finally, the standalone executable is vastly easier to debug than a .dll loaded by the language bindings. Attaching to the standalone process is dead-simple from a debugger, and eliminates the guesswork of, "which java.exe (or ruby.exe or python.exe) process do I need to attach to?"

Of course, there is the matter of an extra thing that needs to be downloaded before you can run WebDriver code against IE. One might be tempted to try bundling the executable inside the language binding package like the .dll used to be, but extracting an executable at runtime and attempting to start it is what antivirus scanners just live to scream about. Protip: When you're designing your framework around WebDriver, use a command-line download utility like wget or curl to be able to download the IEDriverServer.exe from the web as you're setting up your environments.

Thursday, November 1, 2012

.NET Bindings: Whaddaymean "No response for URL"?

The Selenium project just pushed out version 2.26.0 for all of its languages, after a months-long hiatus between releases. The delay wasn't intentional, but it happened, so it's been awhile since the bindings were updated. As usual, the release was accompanied by a post to the user-facing mailing lists for the project. Also as usual, the first reply was asking a question about what didn't get fixed.

The issue, number 3719 in the issue tracker, involves the .NET bindings returning an intermittent failure for some operations. The message text of this error reads, "No response from server for url". The post to the mailing list basically asked why this issue wasn't fixed in the recent release. I had to struggle quite a bit not to summon the snark and respond,

"Sigh. How would you like this particular issue to be fixed?"

The .NET bindings use the .NET Framework's System.Net.HttpWebRequest class for communicating with a remote server that speaks the WebDriver JSON Wire Protocol. I must carefully note here that the term "remote server" can refer to many things. It can refer to an instance of the Java remote WebDriver server. It can also refer to an instance of IEDriverServer.exe or chromedriver.exe, the main components of the Internet Explorer and Chrome drivers, respectively. It can also refer to an instance of the Firefox extension that the FirefoxDriver uses internally to control Firefox. Note that in any of these cases, the "remote server" may, in fact, be running on the same machine as the client bindings. At present almost all of the driver implementations use this architecture of a server component running an HTTP server talking to the client bindings.

So the .NET bindings use the HttpWebRequest class to initiate command with the "server" component. To get the response back from the HTTP server, we call the GetResponse() method. Now, in the normal case, everything is just fine. The bindings get a valid response back from the server, interpret that response, and everything moves right along. Sometimes, the method throws a System.Net.WebException, like if the server is unreachable or the like. The bindings know about that possibility and catch the exception. The exception even has a .Response property on it to allow the bindings to continue to use a valid System.Net.HttpWebResponse to interpret what the remote WebDriver HTTP server is trying to say.

Sometimes, however, the HTTP server doesn't return any response, and it doesn't throw an exception. It just goes off into the aether, never to return. In that case, our response object is null, and here's the real question: What do you expect the .NET bindings to do in that case? The bindings have no idea of the status of the immediately preceding request. They don't know if it succeeded or failed. They don't know if the server is even still breathing or not. That means blindly attempting a retry would be futile at best, and destructive at worst. All the bindings can do is say, "Hey, we sent off a request, like you asked us to, but we didn't get a response back. Don't know what else to tell you, we tried. Sorry it didn't work out."

The worst part about this is that it looks like the bindings are at fault. The bindings are only reporting what happened, and I'm not sure what other approach any sane client could possibly do. Of course, you reading this may disagree. If so, and you have concrete ideas how to solve the problem that don't involve a blind retry or a complete rewrite of the .NET Framework's System.Net.HttpWebRequest class, I'd love to see the implementation. Show me the code; I love receiving patches.

Monday, August 20, 2012

You're Doing It Wrong: IE Protected Mode and WebDriver

There's a common problem most people run into with the Internet Explorer driver when they first start using it with IE 7 and above. Most people start by writing code that looks something like this, expecting it to work on a clean installation of Windows, or at least one with the default settings for Internet Explorer:

WebDriver driver = new InternetExplorerDriver();
driver.get("http://seleniumhq.org");
driver.quit();

Imagine their surprise when they get an exception that looks something like this:

org.openqa.selenium.WebDriverException: Unexpected error launching Internet Explorer. Protected Mode must be set to the same value (enabled or disabled) for all zones. (WARNING: The server did not provide any stacktrace information)

A careful reading of the exception's message tells one exactly what the problem is. People who don't bother to read the exception message then turn to their favorite search engine, and after a quick search, they often blindly modify their code to do something like the following:

DesiredCapabilities caps = DesiredCapabilities.internetExplorer();
caps.setCapability(
    InternetExplorerDriver.INTRODUCE_FLAKINESS_BY_IGNORING_SECURITY_DOMAINS,
    true);
WebDriver driver = new InternetExplorerDriver(caps);
driver.get("http://seleniumhq.org");
driver.quit();

While this will certainly get them past the initial exception, and will allow the test to run in most cases without incident, it's patently the Wrong Thing to do. The operative questions then are, "Why is it wrong, and what is the right way?" If you don't care about why it's wrong, and just want to know how to fix it correctly, you can skip the historical background by clicking on this handy tl;dr link.

Why does the IE driver require Protected Mode settings changes anyway?

Way back through the mists of time, before 2006, life was easy for automating Internet Explorer. A browser session was represented by a single instance of the iexplore.exe executable. A framework for driving IE could instantiate the browser as a COM object using CoCreateInstance(), or could easily get the COM interfaces to a running instance by using the presence of ActiveAccessibility and sending a WM_HTML_GETOBJECT message to the appropriate IE window handle. Once the framework had a pointer to the COM interfaces, you could be sure that they'd be valid for the lifetime of the browser. It also meant you could easily attach to the events fired by the browser through the DWebBrowserEvents2 COM interface.

Then along came the combination of IE 7 and Windows Vista. In and effort to reduce the attack surface presented by malicious web sites, IE 7 introduced something called Protected Mode, which leveraged Mandatory Integrity Control in Windows Vista to prevent actions initiated IE, usually initiated by JavaScript, from being able to access the operating system the way it could in prior releases. While this was generally a welcome development for most users of IE, it created all manner of problems for automating IE.

When you cross into or out of Protected Mode by, say, navigating from an internal intranet website to one on the internet, IE has to create a new process, because it can't change the Mandatory Integrity Control level of the existing process. Moreover, in IE versions after 7, it's not always obvious that a Protected Mode boundary has been crossed, since IE tries to present a better user experience by seamlessly merging the browser window of the new process with the already opened browser window. This under-the-covers process switching also means that any references pointing to IE's COM objects before the Protected Mode boundary crossing are left pointing to objects that are no longer used by IE after the boundary crossing.

How do I fix it?

One way to solve the problem is to either turn User Access Control (UAC) off, or elevate your privileges to an administrator when running WebDriver code, because all process started by elevated users have a High Mandatory Integrity Control level. Elevating to administrative privileges was not a great option, because the process of elevating couldn't and shouldn't be automated. Similarly, turning off UAC is unacceptable as it leaves the machine in a vulnerable state. Nevertheless, in pre-2.0 versions of the IE driver, that's exactly what one had to do to get code working, and it was one of the primary motivations for rewriting the IE driver in 2010.

Since the tricky bit to solve is when Protected Mode boundaries are crossed, a design decision was made to eliminate the boundary crossings. The simplest way to do that is to change the Protected Mode settings in the browser to be the same, either enabled or disabled, it doesn't matter which, for all zones. That way, it doesn't matter what navigation occurs, it won't cross the Protected Mode boundary, and won't trigger orphaning of the COM objects the IE driver relies on. Moreover, setting the Protected Mode boundaries for the security zones are per-user settings in Windows, and don't generally require elevated privileges to set them.

So what's with the Capabilities hack?

When the rewritten IE driver was first introduced, it was decided that it would enforce its required Protected Mode settings, and throw an exception if they were not properly set. Protected Mode settings, like almost all other settings of IE, are stored in the Windows registry, and are checked when the browser is instantiated. However, some misguided IT departments make it impossible for developers and testers to set even the most basic settings on their machines.

The driver needed a workaround for people who couldn't set those IE settings because their machine was overly locked down. That's what the capability setting is intended to be used for. It simply bypasses the registry check. Using the capability doesn't solve the underlying problem though. If a Protected Mode boundary is crossed, very unexpected behavior including hangs, element location not working, and clicks not being propagated, could result. To help warn people of this potential problem, the capability was given big scary-sounding names like INTRODUCE_FLAKINESS_BY_IGNORING_SECURITY_DOMAINS in Java and IntroduceInstabilityByIgnoringProtectedModeSettings in .NET. We really thought that telling the user that using this setting would introduce potential badness in their code would discourage its use, but it turned out not to be so.

Let me state this now in very clear and unambiguous terms. If you are able to set the Protected Mode settings of IE, and you are still using the capability you are risking the stability of your code. Don't do it. Set the settings. It's not that hard.

How to set Protected Mode settings

In IE, from the Tools menu (or the gear icon in the toolbar in later versions), select "Internet options." Go to the Security tab. At the bottom of the dialog for each zone, you should see a check box labeled "Enable Protected Mode." Set the value of the check box to the same value, either checked or unchecked, for each zone. Here's the dialog for reference:

Note that you don't have to change the slider for security level, and you don't have to disable Protected Mode. I routinely run with Protected Mode turned on for all zones, as I think it provides a more secure browsing experience.

Wednesday, July 25, 2012

WebDriver: Y U NO HAVE HTTP Status Codes?!

There's a long-standing issue in the Selenium issue tracker dealing with the fact that the WebDriver API does not expose HTTP status codes to the user. For those of you who like to keep score at home, it's issue #141 in the tracker. The issue was opened in February, 2009, and was closed as "Won't Fix" in December of that year. Despite the finality of many project contributors and the main project architect saying that this feature will not be made available, it has continued to garner comments, many vehement, for reconsideration. I'm going to try to spend a little bit making what I hope is a reasoned, rational argument why this decision was made, and why the feature isn't needed in the project.

How Did We Get Here?

What follows is an oversimplified brief recap of some history. In the beginning was Selenium RC, which was an API that grew organically during its existence, with no rhyme or reason for how methods were tacked onto its single object. Over time, it became a brilliant example of the God object anti-pattern, violating all kinds of object-oriented programming principles. One of the more obscure methods engendered by the organic growth of the project was called captureNetworkTraffic(). This method purported to capture all of the network traffic between the browser and the site being automated, and was made possible because in some browser configurations, Selenium acted as a proxy between the browser and the site being automated, thus all browser traffic passed through Selenium, and could be captured or manipulated.

And so it came to pass that Selenium RC was found wanting, and lo, there was much weeping and wailing and gnashing of teeth in the browser automation community. And thus it was that the WebDriver project was born, and eventually merged into the Selenium project to become Selenium WebDriver. WebDriver was a completely different approach to browser automation, preferring to act more like a user, which solved the fundamental problems inherent in the Selenium RC approach. However, since much of the actual driving of the browser would now be done external to the browser itself, and with no proxy in between the browser and the site being automated, it made creating a method similar to captureNetworkTraffic() difficult at best, and impossible at worst.

Where Are We Now?

This brings us to the current state of affairs, with HTTP status codes being unavailable in the WebDriver API. An architectural decision was made during the creation and development of the WebDriver API that this feature would not be implemented, and that it would be declared "out of scope" for Selenium WebDriver. This has caused some consternation among users, especially since the issue has been closed, and it's been said that the feature won't be implemented, and been said so in extremely plain, even blunt, terms. There are some valid technical reasons why this decision was made. Here are just a few of them:

What about redirects? Do you just return the last code after the redirect, or the code that indicates a redirect? What do you say to those who disagree with your answer to that question?
Some browsers make it impossible to get the status code. Do you really want an API that works on some browsers, but not others?
WebDriver is concerned with driving the browser, not necessarily just web application testing. Does returning HTTP status codes really fit this mission?
WebDriver is concerned with driving the browser as a user would. Does a browser show HTTP status codes to the user, or just rendered pages?

Nevertheless, that hasn't stopped people from arguing that it should be included. Let's take a few minutes to examine some of those arguments, and then we'll take a look at what options there are for solving this issue.

"HTTP status codes are an important part of website testing."

Yes, I can see the argument that HTTP status codes could be an important part of testing your website. However, while web application testing is an important use case of the Selenium project, it's far from the only one. A frequent response to this is, "But your own web pages and literature say it's for web application testing!" That's a fair critique about documentation and public perception of the project, but that's really a separate discussion. HTTP status codes are not a required part of automating the browser.

"I don't care that it doesn't work on all browsers, I need those status codes."

One of the major advantages of the WebDriver API over what has come before is its elegance and purity. Implementing a solution that works in only some browsers is vastly inferior to one that works for all browsers, especially when the solution isn't in the core competency of the library. There are solutions that do work for all browsers without polluting the API, albeit they require integration with other tools that <gasp> are not Selenium. Tacking on this feature for some browsers in a suboptimal way that might miss important edge cases because it's not a core competency is akin to driving a wood screw into a board with a hammer because that's the only tool you have. Yes, it'll work, but it's not going to be pretty, and is likely to fail you somewhere down the line.

"Other parts of the WebDriver API let you do things a user can't do, why not expose status codes?"

Proponents of this argument usually cite manipulation of cookies, or finding of elements by anything other than visual inspection, or viewing a page's HTML source, or any number of other items. All of those items are germane to interacting with the page itself, with what is displayed. The HTTP status code is not directly concerned with what is displayed on the page.

"But so many people want the feature, you should really add it."

Ah, yes, the old, "But I want a pony," argument. Or it's corollary, "20 million New Kids On The Block fans can't be wrong!" This argument is occasionally followed by a sometimes hostile, "Well, your lack of this feature makes your library completely useless to me," or even, "You'd better add it or else I'll stop using it!" This latter response is the equivalent of, "I'm going to hold my breath until you give me exactly what I want!" Just because something's popular doesn't make it a good idea.

Where Do We Go From Here?

Just as proponents of wanting to see HTTP status codes in the WebDriver API are convinced that the arguments against including it don't hold water, the members of the project team are equally convinced that adding it is a bad idea. So if you feel like you absolutely need to have them, what can you do? Well, you have a couple of options.

Remember how I said earlier that Selenium RC was able to give you this information because it acted as a proxy? You can recreate that exact same environment with WebDriver! Of course, you have to use a dedicated proxy to do it, but guess what? There are lots of software proxies around that make this really easy. Additionally, you're using the proxy to capture the traffic, not something half-baked that's been shoehorned into the WebDriver API, which is a little thing I like to call using the right tool for the job. As an example, the BrowserMob proxy is one that's being used successfully by lots of people. It's open-source, and it's written in Java, so if you're using Java, you can even control it directly from your existing code. If you're not using Java, fear not, as there are wrapper libraries written for many different languages, including Python, .NET, Ruby, and PHP, to name a few. The WebDriver library even allows you to set the browser you're automating to use the proxy.

"But I don't want to use a proxy! I only want to have to manage Selenium as a dependency," I hear you say. I've got a little secret for you: WebDriver is Open Source Software. It's even very liberally licensed. If you don't like a decision made by the project team, fork it, create a patch, and share with the world. I personally love innovation in the Open Source world. Don't tell me how you want to see it done, show me, with working code. Unless you can demonstrate your solution working for all browsers though, I'll probably use a proxy if I need this functionality, and that's my choice.

Thursday, June 28, 2012

Whither NuGet and the WebDriver .NET bindings

It happens every so often. Someone is trying to use the WebDriver .NET language bindings, and it happens that their project also uses the excellent Json.NET library created by James Newton-King. They decide they want to use NuGet for managing their dependencies. Everything moves along swimmingly until there's an update to the Json.NET library. NuGet dutifully updates the dependency, but when they go to run their WebDriver tests, and everything blows up. After a cursory investigation, they see that the NuGet package for the Selenium WebDriver .NET bindings have a dependency on the Json.NET package, but the package declares the dependency on a specific version of Json.NET. When people see this, they inevitably point out that there must be an error in the authoring of the Selenium.WebDriver package. Hopefully, I'll be able to explain exactly why this isn't an error in the package authoring, and why naive proposed solutions won't solve this.

First, a little background. The build process of the .NET bindings is integrated with the build process for the other portions of the Selenium project. This means the released binaries aren't built through Visual Studio, and aren't built via MSBuild. The Selenium build script uses Rake as it's build engine, via the Albacore project. We have some unique requirements that don't allow us to just build the .csproj file and be done with it. As part of the build process, we need to embed the outputs of other targets (which are not built via Visual Studio or MSBuild) as resources into the .NET assembly. To accomplish this, we end up calling csc.exe with the appropriate command-line switches. This is a nicely stable build process, follows the patterns of the build for the other portions of the Selenium project, and doesn't require hand-development and maintenance of a custom MSBuild project file.

As part of the build process, we also sign the assembly to give it a strong name. This was a feature request of users who wished to reference the WebDriver .NET bindings from a project that also had a strong name. You see, when you create a strong-named assembly, any assemblies that you reference must also be strong-named. Part of the strong name of such an assembly is its version number. When your strong-named assembly tries to resolve the reference to another strong-named assembly, but the only copy you have of the referenced assembly is of a different version, the .NET Framework cheerfully tells you, "I don't have a copy of the assembly you're looking for." Which you don't, because the name of the assembly (which contains the version information because it's a strong name) doesn't match.

Complicating this behavior is that if your assembly is unsigned, it is free to reference assemblies with and without strong names. Furthermore, the unsigned assembly will load referenced assemblies matched by file name only, happily ignoring the version information in the referenced assembly. Since many people don't bother giving their assemblies strong names, they never have to think about the versions of the assemblies they reference.

I'm sure those of you who've encountered this before are already far ahead of me. Both the WebDriver .NET bindings assembly and the Json.NET assembly have strong names. That means that the WebDriver assembly can use only the version of the Json.NET assembly that it was compiled against. When building the NuGet package for the .NET bindings, we must restrict the packages we reference to the exact versions we compile against, otherwise the .NET bindings won't be able to load the referenced assembly if NuGet downloads the latest version of Json.NET, and it's not the version we've compiled against. Of course, that's a problem if your project is unsigned and also requires a reference to Json.NET, and your reference isn't similarly scoped with respect to version, because now you have a version conflict. This is why simply changing the version specification for the Json.NET reference in the WebDriver .nuspec file to be a "greater than or equal to version x" vs. "be exactly x" won't solve the problem.

Some people suggest that migrating the project to use NuGet to manage its dependencies is a viable solution to this problem. Given all that we already require to successfully build the project, I remain unconvinced that adding one more technology that prospective contributors have to know about is not the friendliest approach. This is particularly so if the prospective contributor in question has no need or desire to use NuGet in his or her own projects. Getting people to engage and contribute is hard enough without throwing other roadblocks in their way (but that's another rant for another day).

It should be noted that this is a well-known issue with creating NuGet packages with strong-named assemblies. As far as I'm aware, no one has come up with a good solution. The most promising solution would be to change the release of the WebDriver assemblies to be unsigned. However, this would amount to a change in functionality for anyone who is referencing them from a signed assembly, and would likely frustrate them just as much as those who might be frustrated by the current state of affairs in the NuGet package.

Update 6 July 2012: Apparently, James Newton-King, the author of the Json.NET library, has been experimenting with a solution to this problem. I'll be modifying the .NET bindings NuGet package to take advantage of this soon. There's still a fundamental design flaw in the mismatch between packaging and strong-named assemblies, but for a widely-used, frequently-updated library like Json.NET, this is likely the best solution.

Friday, June 22, 2012

What's Wrong With the Internet Explorer Driver?

The Internet Explorer driver in the Selenium WebDriver project has consumed far too much of my life over the last two years, which is when I first undertook investigating rewriting the driver to repair some of its shortcomings. During that time, I've learned far more about COM programming in the C/C++ world than I'd ever known before, probably far more than I ever cared to know. The driver has been in widespread use as part of the regular Selenium releases for nearly 18 months now, and I think it's time to take stock once and for all about the currently known issues with so-called "native events" used to interact with elements in the IE driver.

My purpose in bringing this up is twofold. First, it's a good place to acknowledge where we still have to come in order to make the IE driver as good as possible for the users of Selenium. Secondly, it gives me the chance to reiterate the open invitation that I've always had for anyone who'd like to review the code of the IE driver, make improvements, and submit patches. Before launching into these challenges, I need to cover a few basic assumptions about why the driver is architected the way it is.

There are a few main principles of the WebDriver project that directly impact the decisions of how the IE driver is built. They are, in roughly priority order:

The driver should be installable with an xcopy deployment mechanism, and be capable of execution without elevating to admin permissions. Sadly, a significant percentage of the users of Selenium do not have admin access to their Windows machines, which precludes the use of a plugin (Browser Helper Object or BHO) for IE, as these require admin access to the registry at the very least.
The driver should use "native events" to interact with elements. This means using OS-level mechanisms to simulate user keyboard and mouse inputs. These would be contrasted by "synthetic events" which are JavaScript simulations of those inputs, which have challenges with accuracy of simulation and fidelity of operation.
The driver should not require the browser instance being automated to be the focused window in the OS. It is expressly a goal of WebDriver that developers running WebDriver code will be free to use their machines for other purposes even while that code is running in the background.

Given these three requirements then, there are two major problems with using native events in the IE driver. The first of these is that mouse clicks to the IE window get swallowed up when the browser window does not have the system focus. In this case, the element in question has a focus rectangle around it, but no click appears to have happened.

Screenshot of an element click into a background IE window

Let me say a brief word about how native events are done in the IE driver. The driver currently works by sending Windows messages to the IE window being driven. We know this is not necessarily the best approach, and that the "correct" way to simulate input for the IE window would be using the SendInput() API. However, this would require the browser window to have focus, which opens up all manner of issues, particularly if you have two InternetExplorerDriver instances attempting to manipulate pages at the same time.

The "flashing hover" problem

The second issue is with mouse hovers using native events. The symptoms of this problem are that when you execute a mouse hover using the Actions class of the IE driver, and your physical mouse pointer is within the bounds of the browser window, the menu will flash and immediately disappear. As near as I've been able to determine, IE is doing some sort of hit-testing (probably calling the WindowFromPoint() API) when it receives a WM_MOUSEMOVE message which redraws the canvas if the location of the physical cursor is detected inside the window boundaries.

Solving the above two problems would go a long way toward the long-term stability of the IE driver. Alternatively, if the IE team at Microsoft took over the maintenance of the driver, as the teams responsible for Chrome, Opera, and soon Firefox have done, then the true experts in the architecture of the Internet Explorer browser could bring all of their experience and expertise to bear, and help make Internet Explorer a first-class citizen in the world of automated web testing.

In fact, that would be my challenge to the IE team at Microsoft: step up, and contribute code to the IE driver. Take the existing code, and do something great and awe-inspiring with it. Become a leader in this space, instead of the lagging follower you currently are.