Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wireshark is the proverbial hammer that makes all networking problems look like nails. Even if there's a more specific tool available there's a good chance I can swing Wireshark at the problem and figure it out.

It continues to blow my mind how many people there are in the world who consider themselves networking professionals but have never used or do not understand Wireshark. It is possibly the most important tool for actually understanding what's really happening in your network, without it you're effectively blind to so many things.

Just yesterday I used it to troubleshoot a weird behavior in a recently upgraded Asterisk/FreePBX system which would have probably taken me days to guess my way through without packet captures, but with them I was able to see clearly what was happening on the network and then track that back from there.

Congrats to everyone on the Wireshark team on 25 years of making network troubleshooting infinitely easier! I would 100% not be where I am today without it.



It's like a debugger for networking, and surprisingly many programmers don't know how to use debuggers either.


It's more like strace but yeah.


And a lot of programmers don’t know about strace either


I can confirm, I didn't know about strace until this very moment. Looking at it, it basically only intercepts system calls? How often is that useful? What do people use it for?


It answers, or at least gives the definitive first clue behind a huge number of slow downs or apparent hangs, given how many of those are actually blocking resource waits or retry loops gone mad.

It's probably most useful to sysadmins working with binaries, or even if you do do have the source, it's usually a shorter path to the solution for any app/os interaction problem.

It's useful for certain classes of optimisation and tuning, because it will give timings and aggregate timings.

I'll use it for things as simple as "where is this program reading it's config files" - often useful when doco is poor and/or there are multiple config locations selected by conditional logic.

There's an "ltrace" as well, for share library tracing, although I've personally found that less useful - bugs that that shows are more likely to be code/logic problems rather that os/infrastructure interaction - which is to say, usually outside my job scope.

On commercial unix, the equivalent to strace is truss, and it's been around forever.

Like many, wirewhark, strace/truss are my go-to tools for a huge amount of troubleshooting.


Program shits itself randomly during execution, crashes, and doesn't tell you why. strace it. Oh, it turns out it's trying to execve() a binary that doesn't exist on the system, which wasn't documented as a dependency, so I didn't install it. Fixed.

Lots of little things like that. Why is this program acting slow at startup when it should be fast? Oh, because it's opening and timing out on a socket connection with an unusually long timeout. Et cetera...


The most fun I've had with strace was debugging a 3-process deadlock. An snmp daemon was blocked waiting for a cli child process to finish, the cli was waiting for a response to a message on a socket it had open with a routing protocol daemon, which was waiting for a response from the snmp daemon.

It is also a great way to figure out why programs without useful debug output die. Ie. after a program opens and reads a config file it doesn't like, it starts cleaning up and exits.


I recently fired it up to quickly check which headers a crosscompiler used on a specific compilation unit. strace, grep, sort, done. I also use it as first check if something seems to hang. Sometimes you can see lock files trying to be acquired or access to wrong paths.


Just recently I found something performing way better for that task: https://brendangregg.com/blog/2014-07-25/opensnoop-for-linux...


`strace -f -eopen,openat` to see which files a programs opens. Very often useful, even if just to check which config file(s) a program reads.


I've used it on occasions to try and find why some app was erroring; typically the app would catch an exception (or ALL of them), and just die with something like "no."

For example if the open call fails with ENOENT, the file it's looking for doesn't exist, and strace will (also) tell you what file it's trying to open.


I use it constantly. I can't even imagine debugging some failures without strace. It's great for servers that don't log things but fail to load some config file (which can be debugged by inspecting the return codes of open() calls.

There is also ltrace, for library calls, although I find it less useful.


It lets you see what a program is doing, where doing means any “effects” of the program that touch the system. Want to see whatS happening to files? You can strace for certain operations to certain files. Weird shitting the bed involving IO? Strace will illuminate the problem


As well as all the other uses, it can be great for a quick way to see what files a program is trying to access. E.g. some undocumented binary where it's not clear where its config should be, strace will quickly show what it tries to access.


it's useful when everything else you've tried failed and you have no clue what is going on. it's extremely helpful in figuring out why a program hangs or crashes. it's a good tool to have in the toolbox


Do you have any favorite sources for really understanding Wireshark? I’m not a networking professional per se, but I’m network-adjacent and I’ve dabbled in Wireshark from time to time. I can see the power, but it’s also one of those tools that’s totally overwhelming when I first approach it unless I have a very small, very specific problem. Or is it one of those tools that you learn as you need it?


There’s a million little features and tricks you can do but you’ll never stumble into them unless you’re actively googleing “how do I …”.

You might look for some pcap based CTFs with walkthroughs to get exposure to some of the more unique things you can do.

Just letting it run for a few min on your router and then powering a device up can also yield some interesting captures…


Unfortunately I can't really help there, I'm a "learn by doing" type of person who just jumps in the deep end and hopes he figures out how to swim.

Most of my learning was just "capture the problem happening, capture what happens when it works right if possible, open up the relevant RFCs, then try to understand what's different and why.

I work in the VoIP industry so I'm dealing with a lot of NAT problems (insert rant here about lazy ISPs that still haven't enabled IPv6 on their networks) and my main protocol (SIP) is heavily inspired by HTTP and as a result is more or less human readable plaintext, so it was a relatively easy learning curve to just have Wireshark open on one side of the screen and the relevant RFCs on the other side.

All I can really say is have a problem you want to solve and start from there.


Sounds reasonable to me, thanks. That’s how I always end up learning, but sometimes I wonder if there’s a better way.


> Just yesterday I used it to troubleshoot a weird behavior in a recently upgraded Asterisk/FreePBX system which would have probably taken me days to guess my way through without packet captures

Do you mind sharing with us what was the problem and how you solved it with packet captures, if you have time? A blog post would be very interesting too.


For something like this firefox bug [1], getting down to pcaps helps determine where the problem is. Client is spinning on a request and server doesn't know about it could be a server problem or a client problem or a network in the middle problem.

In this case, the problem was the client wasn't actually sending the request, and with a sizable request that's visible even without decoding the https; although to be totally clear on what was happening, decoding was needed.

I've also debugged issued in remote networks where iirc, connections were being reset by some equipment local to the user. Seq/ack sequencing showed the resets were in response to a specific client sent packet and the timestamps showed it was impossible for that to have come from anywhere but equipment near the user.

For this bug [2], it took a lot of luck and patience to get a good capture, but once I did, the immediate problem became obvious: the machine I controlled was getting an icmp needs frag but DF set at the same mtu it was already using, and responding by sending the whole sendqueue at once, packetized to the new MTU that was the same as the old one. There's actually three problems here: a) there's no reason for the other side to send this packet (I found this is an already fixed linux bug with forwarding and large receive offload, but no way to contact the administrator of that router), b) our side shouldn't resend the whole sendqueue when the mtu changes, c) if the mtu didn't change, then there's no need to take any action. We only fixed c, but that solved the major problem: these resends would trigger more resends and we'd have periods of unavailability as the network was really busy.

This is pretty common when looking at wireshark; unless you work somewhere with full control of all clients and servers and a very network aware developer team, you're going to find lots of non-optimal or semi-broken stuff, and you've got to ignore it and focus on the majorly broken bit.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1740856

[2] https://reviews.freebsd.org/rS288412


I had to troubleshoot an issue where several network routers restarted in group without a cause but only when connected to the big wan. The problem was a network discovery software which when poorly configured, would send ssh connection attempts to the management interface and a bug on the specific firmware would crash the router.

A network capture was the only good clue.


I'm not much of a blogger, but here's the short version. If anyone happened to be on #freepbx yesterday morning they might already have seen this.

I had just upgraded and migrated one of my clients from an on premise FreePBX system that was a few years out of date and running on a repurposed desktop computer with a failing fan to a brand new instance running on a VPS. Everything was working fine with basic phone functionality, but their main ring group was taking a few seconds to stop ringing when answered. Calls would ring in to all phones effectively simultaneously as expected, but when someone answered the call certain phones kept ringing for almost four full seconds after that point.

In the past I had seen similar behaviors on AT&T DSL caused by their mandatory modem/router device having an anti-flood filter enabled by default which saw a bunch of nearly identical UDP packets hitting at once and dropped them after the first few. This site has cable internet through a dumb modem so I knew it wasn't that, but they had recently had their IT side taken over by a new company who put in a new firewall so that was a plausible answer.

Their IT however had been taken over from us so I wasn't about to go accusing them of getting it wrong without strong evidence. I'm also just that kind of person, I hate when someone blames me or my gear for problems we're not causing so I do my best to never be that guy either. I'll waste an extra few hours of mine any day of the week to be sure I'm not accusing someone else of getting it wrong without a reason.

I fired up sngrep on the server, waited for a call to come in, and saved all the SIP sessions that resulted. Download that file, load it up in Wireshark, and I see that while the INVITE messages to start ringing all went out more or less simultaneously (27 phones in ~5ms) the CANCEL messages that stop them from ringing once one answered were sent out sequentially, with the PBX waiting for the first one to respond and confirm it had stopped ringing before sending the next. Clearly this wasn't right, and it obviously wasn't a problem with the firewall either.

At that point I started looking at the Asterisk logs and saw that an AGI script was being run for each line that was ringing which wasn't there previously. That script was associated with a new FreePBX module for missed call notifications which was installed but unconfigured on the new server. It didn't indicate it was doing anything in the UI, but it sure seemed to be doing something in the logs.

I uninstalled that module and the next call all the CANCEL messages went out in ~5ms just like the INVITEs. I then filed a bug with FreePBX documenting what happened because I'm pretty sure it's not expected or desired for simply having that module installed to cause massive delays in ring groups.

---

In this case the packet captures demonstrated conclusively that the problem was on the server itself and not in the network. If the capture at the server had looked reasonable my next step would have been to have the IT vendor capture traffic on their firewall at the same time as I was capturing at the server so we could compare and see if it's getting messed with along the way, but here it was not necessary.

Like toast0 mentioned, captures help you narrow down where the problem is.


Forget something as specific and hardcore as networking. I had a week to build a nodejs poc of a legacy spring/java app/service in Amazon that was doing a bunch of service to service auth with some Tibco messaging. I couldn't find any open implementations of Tibco clients (around 2013) and the frugality leadership principle meant getting an official spec would be almost impossible. I just needed a few details of the packet structure on a couple of requests. You can guess which tool saved the day for me here! Principle Eng at the time was surprised such a tool existed!


Fiddle is awesome too.


Fiddler classic is what I always go for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: