//Cloud notes from my desk -Maheshk

"Fortunate are those who take the first steps.” ― Paulo Coelho

[ServiceFabric] How to change/reset RDP password for Service Fabric VMSS/VMSS instances using Powershell

Today, I had this question suddenly asked by one of my colleague for his customer. I never tried this before but aware that it was not that straight forward to reset from Azure portal Smile. After searching my emails, I found a PS script recommended in the past. I was curious to test and share it. so quickly deployed a cluster and verified. It worked.

Login-AzureRmAccount
$vmssName = “mltnnode”
$vmssResourceGroup = “jailbird-SF-RG”
$publicConfig = @{“UserName” = “mikkyuname”}
$privateConfig = @{“Password” = “newpass@1234”}
$extName = “VMAccessAgent”
$publisher = “Microsoft.Compute”
$vmss = Get-AzureRmVmss -ResourceGroupName $vmssResourceGroup -VMScaleSetName $vmssName
$vmss = Add-AzureRmVmssExtension -VirtualMachineScaleSet $vmss -Name $extName -Publisher $publisher -Setting $publicConfig -ProtectedSetting $privateConfig -Type $extName -TypeHandlerVersion “2.0” -AutoUpgradeMinorVersion $true
Update-AzureRmVmss -ResourceGroupName $vmssResourceGroup -Name $vmssName -VirtualMachineScaleSet $vmss

image

For Linux:- https://azure.microsoft.com/en-us/blog/using-vmaccess-extension-to-reset-login-credentials-for-linux-vm/

Ps:- Allow few mins to go through this VMSS instance update. You can navigate to VMSS > Instances to see the update is over and in “running” state, so that you can start RDP with your new password.

Advertisements

2017-08-21 Posted by | Powershell, ServiceFabric | | Leave a comment

[Azure Service Fabric] Five steps to achieve Event aggregation and collection using EventFlow in Service Fabric

Monitoring and diagnostic are critical part in application development for diagnosing issue at production or development time. It helps one to easily identify any application issue, h/w issue and performance data to guide scope for improvement. It has 3 part workflow starts with 1) Event Generation 2) Event Aggregation 3) Analysis.

1) Event Generation –> creation and generation of events & logs. Logs could be of infra level events(anything from the cluster) or application level events (from the apps and services).

2) Event Aggregation –> generated events needs to be collated and aggregated before they can be displayed

3) Analysis –> visualized in some format

Once we decide the log provider, the next phase is aggregation. In Service Fabric, the event aggregation can be achieved by using (a) Azure Diagnostic logs (agent installed on VM’s) or (b) EventFlow (in process log collection).

Agent based log collection is a good option if our event source and destination does not change and have one to one mapping. Any change would require cluster level update which is sometime tedious and time consuming. In this type, the logs get tanked in storage and then goes to display phase.

But in case of EventFlow, in process logs are directly thrown to a remote service visualizer. Changing the data destination doesn’t require any cluster level changes as like in agent way update. Anytime we can change the data destination path from this file eventFlowConfig.json. Depends on the criticality we can have both if required. However, Azure diagnostics logs are recommended for mostly infra level log collection where as EventFlow suggested for Application level logs. The last step is Event Analysis where we analysis and visualize the incoming data. Azure Service fabric has better integration support for OMS and Application Insights.

In this article, let us see how one can easily use EventFlow in their Service Fabric Stateful application in 5 steps.

Step1:- Let say, create a new Service Fabric Project by selecting “Stateful Service” application. Pls change the .NET version of the project to 4.6 and above.

Step2:- Right click and add the following nuget packages. Search for “Diagnostics.EventFlow” and then add the following packages. 

    Microsoft.Diagnostics.EventFlow.ServiceFabric
    Microsoft.Diagnostics.EventFlow.Outputs.ApplicationInsights
    Microsoft.Diagnostics.EventFlow.Input.EventSource

          image

Step3:- Update the eventflowconfig.json file as below. Event Source class uses the Json file to send the data. This file needs to be modified to capture data or configure to desired destination.

          image

Step4: Update the “ServiceEventSource.cs” class.  We need a name of Service’s ServiceEventSource is the value of the attribute set for this class.  

          image

Step5:- Instantiate the EventFlow pipeline in our service startup code and start writing the service message.

        image

       image

Deploy the application and confirm all green and no issue with deployment or any dependency issue.

image

To verify the trace logs, you can log into portal.azure.com > your_application insights > search and refresh (allow few mins to see the data flowing here )

image

Reference article:-

https://github.com/Azure/diagnostics-eventflow

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-diagnostics-event-aggregation-eventflow

2017-07-11 Posted by | .NET, Azure Dev, C#, LogCollection, ServiceFabric, VS2017 | | Leave a comment

[Azure Service Fabric] How to launch Local cluster manager without Visual Studio

Recently I had this problem where my local cluster manager got disappeared from system tray without clue. I tried loading Visual studio and ran SF project to bring it back which is time consuming for a simple demo.

Problem:- Is there a way to launch only the cluster manager without visual studio?

Answer:- Yes, we can do that by simply running “ServiceFabricLocalClusterManager.exe” exe from run prompt. 

image

image

To get this exe, we need to install the SDK http://aka.ms/ServiceFabricSDK and navigate to this path “C:Program FilesMicrosoft SDKsService FabricToolsServiceFabricLocalClusterManager.exe”. Btw, ServiceFabricLocalClusterManager.exe is required to manage our local dev cluster.

2017-06-14 Posted by | .NET, Azure, ServiceFabric, VS2017 | | Leave a comment

[Azure Service Fabric] Use of EnableDefaultServicesUpgrade property

Recently I had this issue where Service Fabric application upgrade fails to deploy as expected after changing the instance count in cloud.xml. Here is what I tried and error received.

problem:-

  1. Create a stateless project with latest Azure Service fabric sdk 5.5
  2. Deploy first with Stateless1_InstanceCount set to –1  (default)
  3. Now set Stateless1_InstanceCount to say 2 from cloud.xml and redeploy with upgrade option checked

While publishing this upgrade from visual studio, I saw a property value expected to be “true” but no clue at initial glance.

Visual studio error:-

1>—— Build started: Project: Application3, Configuration: Debug x64 ——
2>—— Publish started: Project: Application3, Configuration: Debug x64 ——
2>Started executing script ‘GetApplicationExistence’.
2>Finished executing script ‘GetApplicationExistence’.
2>Time elapsed: 00:00:01.5800095
——– Package started: Project: Application3, Configuration: Debug x64 ——
Application3 -> D:Cases_CodeApplication3Application3pkgDebug
——– Package: Project: Application3 succeeded, Time elapsed: 00:00:00.7978341 ——–
2>Started executing script ‘Deploy-FabricApplication.ps1’.
2>. ‘D:Cases_CodeApplication3Application3ScriptsDeploy-FabricApplication.ps1’ -ApplicationPackagePath ‘D:Cases_CodeApplication3Application3pkgDebug’ -PublishProfileFile ‘D:Cases_CodeApplication3Application3PublishProfilesCloud.xml’ -DeployOnly:$false -ApplicationParameter:@{} -UnregisterUnusedApplicationVersionsAfterUpgrade $false -OverrideUpgradeBehavior ‘None’ -OverwriteBehavior ‘SameAppTypeAndVersion’ -SkipPackageValidation:$false -ErrorAction Stop
2>Copying application package to image store…
2>Copy application package succeeded
2>Registering application type…
2>Register application type succeeded
2>Start upgrading application…
2>Unregister application type ‘@{FabricNamespace=fabric:; ApplicationTypeName=Application3Type; ApplicationTypeVersion=1.1.0}.ApplicationTypeName’ and version ‘@{FabricNamespace=fabric:; ApplicationTypeName=Application3Type; ApplicationTypeVersion=1.1.0}.ApplicationTypeVersion’ …
2>Unregister application type started (query application types for status).
2>Start-ServiceFabricApplicationUpgrade : Default service descriptions can not be modified as part of upgrade.
2>Modified default service: fabric:/Application3/Stateless1. To allow it, set EnableDefaultServicesUpgrade to true.
2>At C:Program FilesMicrosoft SDKsService
2>FabricToolsPSModuleServiceFabricSDKPublish-UpgradedServiceFabricApplication.ps1:248 char:13
2>+             Start-ServiceFabricApplicationUpgrade @UpgradeParameters
2>+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2>    + CategoryInfo          : InvalidOperation: (Microsoft.Servi…usterConnection:ClusterConnection) [Start-Servi
2>   ceFabricApplicationUpgrade], FabricException
2>    + FullyQualifiedErrorId : UpgradeApplicationErrorId,Microsoft.ServiceFabric.Powershell.StartApplicationUpgrade
2>
2>Finished executing script ‘Deploy-FabricApplication.ps1’.
2>Time elapsed: 00:00:22.5520036
2>The PowerShell script failed to execute.
========== Build: 1 succeeded, 0 failed, 1 up-to-date, 0 skipped ==========
========== Publish: 0 succeeded, 1 failed, 0 skipped ==========

Upon searching in our internal discussion forum, I noticed this property needs an update from resources.azure.com or through PS.

By default, we would be having this property set to “-1” in cloud.xml or application manifest xml. The value “-1” is default and it deploys to all available nodes. At situation, we  may need to reduce the instance count, so if that this is the case follow any of the option.

Option # 1 ( Update through resources.azure.com portal )

1) From the error message it is clear that, sf cluster expects a property “EnableDefaultServicesUpgrade” to be set it true to proceed this upgrade.

2) This link talks about adding sf cluster settings from resources.azure.com portal – https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings  ( refer the steps at the top of the page).

3) Update your cluster settings as below and wait for atleast 30-40 mins depends on the number of nodes etc.

         werer

4) After this PUT command, you would see a small banner message saying upgrading cluster in Portal.azure.com > sf cluster overview page blade.

5) Wait till the upgrade banner goes away so that you can run the GET command from resources.azure.com to confirm this value is reflecting or not.

Option#2: ( update through  PS )

You can use the below PS to update this value.

$ClusterName= “<your client connection endpoint > eg. abc.westus.cloudapp.azure.com:19000”

$Certthumprint = “xxxxxx5a813118ef9cf523a4df13d”

Connect-serviceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 `

-X509Credential `

-ServerCertThumbprint $Certthumprint  `

-FindType FindByThumbprint `

-FindValue $Certthumprint `

-StoreLocation CurrentUser `

-StoreName My

Update-ServiceFabricService -Stateless fabric:/KeyPair.WebService/KeyPairAPI -InstanceCount 2

https://docs.microsoft.com/en-us/powershell/module/servicefabric/update-servicefabricservice?view=azureservicefabricps 

Final step:-

After the settings update, now go back to Visual Studio (2017) and try publishing app upgrade. At this point, we should see application getting deployed without any error.

You can confirm this by checking the number of node where this app is deployed. From the service fabric explorer (SFX) portal, you could see our application deployed just in 2 nodes instead all the available nodes. 

I had 3 node cluster where I set the instance count to 2 to see the application reduction.

Note:- The only caveat here is, we won’t be seeing the SFX portal manifest having this latest instance count value reflected. It would still show “-1” which you can ignore.

2017-05-24 Posted by | ARM, Azure, PaaS, Powershell, ServiceFabric | | 1 Comment

[Service Fabric] How to Secure a standalone cluster (On Prem)

This blog post is based on this article –https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-creation-for-windows-server; I ran into issue, so would like to break into step by step for easier reference along with precautions.

Step 1 and 2 objective is same. All we need here is, just make sure to have NETWORK SERVICE added and have permission set.

1) Install the certificate in server node :-  Console Root > Local Computer > Personal > Certificates > install at this level and hit refresh to confirm

clip_image001

clip_image002

Once after the installation, right click the cert > All tasks > Manage private keys > Add NETWORK SERVICE and provide the default permission as it is and save. we should see “Allow” for Full control & Read permission.

2) Alternatively you could also achieve the same using PS. Open the PS ISE window, run the below PS in Admin mode to make this update. This step is optional if you have already performed the step #1 manually.

param

(

[Parameter(Position=1, Mandatory=$true)]

[ValidateNotNullOrEmpty()]

[string]$pfxThumbPrint,

[Parameter(Position=2, Mandatory=$true)]

[ValidateNotNullOrEmpty()]

[string]$serviceAccount

)

$cert = Get-ChildItem -Path cert:LocalMachineMy | Where-Object -FilterScript { $PSItem.ThumbPrint -eq $pfxThumbPrint; }

# Specify the user, the permissions and the permission type

$permission = “$($serviceAccount)”,”FullControl”,”Allow”

$accessRule = New-Object -TypeName System.Security.AccessControl.FileSystemAccessRule -ArgumentList $permission

# Location of the machine related keys

$keyPath = Join-Path -Path $env:ProgramData -ChildPath “MicrosoftCryptoRSAMachineKeys”

$keyName = $cert.PrivateKey.CspKeyContainerInfo.UniqueKeyContainerName

$keyFullPath = Join-Path -Path $keyPath -ChildPath $keyName

# Get the current acl of the private key

$acl = (Get-Item $keyFullPath).GetAccessControl(‘Access’)

# Add the new ace to the acl of the private key

$acl.SetAccessRule($accessRule)

# Write back the new acl

Set-Acl -Path $keyFullPath -AclObject $acl -ErrorAction Stop

# Observe the access rights currently assigned to this certificate.

get-acl $keyFullPath| fl

———————-

Parameter:-

On execution, enter your cert thumbprint and service account details as below.

pfxThumbPrint: AA4E00A783B246D53A88433xxxx55F493AC6D7

serviceAccount: NETWORK SERVICE

Output:-

Path   : Microsoft.PowerShell.CoreFileSystem::C:ProgramDataMicrosoftCryptoRSAMachineKeys

Owner  : NT AUTHORITYSYSTEM

Group  : NT AUTHORITYSYSTEM

Access : Everyone Allow  Write, Read, Synchronize

         NT AUTHORITYNETWORK SERVICE Allow  FullControl

         BUILTINAdministrators Allow  FullControl

Audit  :

Sddl   : O:SYG:SYD:PAI(A;;0x12019f;;;WD)(A;;FA;;;NS)(A;;FA;;;BA)

3) Step (1 or 2) is the only change required at Server side for certificate.

Now start downloading > “Download the Service Fabric standalone package” https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-creation-for-windows-server and extract say C:WindowsServiceFabricCluster

4) Pick this template “ClusterConfig.X509.DevCluster” json template file and update with your thumbprint and save it.

Ps:- I have removed secondary certificate and proxy certificate section for simplicity

   “security”: {

            “metadata”: “The Credential type X509 indicates this is cluster is secured using X509 Certificates. The thumbprint format is – d5 ec 42 56 b9 d5 31 24 25 42 64.”,

            “ClusterCredentialType”: “X509”,

            “ServerCredentialType”: “X509”,

            “CertificateInformation”: {

                “ClusterCertificate”: {

                    “Thumbprint”: “AA4E00A783B246D53Axxxxx3203855F493AC6D7”,

                    “X509StoreName”: “My”

                },

                “ServerCertificate”: {

                    “Thumbprint”: “AA4E00A783B246D53A8xxxxxx3855F493AC6D7”,

                    “X509StoreName”: “My”

                },

                “ClientCertificateThumbprints”: [

                    {

                        “CertificateThumbprint”: “AA4E00A783B24xxxxx203855F493AC6D7”,

                        “IsAdmin”: false

                    },

                    {

                        “CertificateThumbprint”: “AA4E00A783B246D5xxxxxx203855F493AC6D7”,

                        “IsAdmin”: true

                    }

                ]

            }

        },

5) Now run the PS command let from this directory to create the cluster

PS C:WindowsServiceFabricCluster>.CreateServiceFabricCluster.ps1 -ClusterConfigFilePath .ClusterConfig.X509.DevCluster.json -AcceptEULA

Creating Service Fabric Cluster…

If it’s taking too long, please check in Task Manager details and see if Fabric.exe for each node is running. If not, please look at: 1. traces in DeploymentTraces directory and 2. traces in FabricLogRoot

configured in ClusterConfig.json.

Trace folder already exists. Traces will be written to existing trace folder: C:tempMicrosoft.Azure.ServiceFabric.WindowsServerDeploymentTraces

Running Best Practices Analyzer…

Best Practices Analyzer completed successfully.

Creating Service Fabric Cluster…

Processing and validating cluster config.

Configuring nodes.

Default installation directory chosen based on system drive of machine ‘localhost’.

Copying installer to all machines.

Configuring machine ‘localhost’.

Machine localhost configured.

Running Fabric service installation.

Successfully started FabricInstallerSvc on machine localhost

Successfully started FabricHostSvc on machine localhost

Your cluster is successfully created! You can connect and manage your cluster using Microsoft Azure Service Fabric Explorer or Powershell. To connect through Powershell, run ‘Connect-ServiceFabricCluster [

ClusterConnectionEndpoint]’.

6) At this stage, we should see the cluster creation success message with that. We are done with cluster creation and securing them.

7) Now at client side/end user machine where try to browse the secured cluster over IE, we should see dialog prompt asking for certificate. It means, it is working as expected – so far good.

8) Now install the client certificate at your client machine. For simplicity sake, I am using the same machine as Client and Server. But certificate has to be installed under Current User when accessing the cluster over IE.

Certmgr > Current User > Personal > Certificates.

clip_image003

9) Now we are ready to access, browse the cluster url say – https://localhost:19080/, we should see a cert selection dialog displayed.

clip_image004

clip_image005

 

How to create self signed certificate (PFX):- (Optional)

—————————————————————–

1) Open the PS windows > Run this script as .CertSetup.ps1 -Install.

CertSetup.ps1 script present inside the Service Fabric SDK folder in the directory C:Program FilesMicrosoft SDKsService FabricClusterSetupSecure. You can edit this file if you do not wanted certain things in that PS.

2) Export .cer to PFX

$pswd = ConvertTo-SecureString -String “1234” -Force –AsPlainText

Get-ChildItem -Path cert:localMachinemy<Thumbprint> | Export-PfxCertificate -FilePath C:mypfx.pfx -Password $pswd

Precaution:-

  1. How to: Retrieve the Thumbprint of a Certificate
    https://msdn.microsoft.com/en-us/library/ms734695(v=vs.110).aspx
  2. Remove the invisible chars in Thumbprint (User notepad++ > Encoding > Encode in ANSI to reveal the invisible chars) – Don’t use Notepad. http://stackoverflow.com/questions/11115511/how-to-find-certificate-by-its-thumbprint-in-c-sharp
  3. Couple of this PS will help us to remove the cluster or clean the previous installation RemoveServiceFabricCluster.ps1 & .CleanFabric.ps1
  4. Make sure to use PFX and not the cert. Just in case, if you are run into some environment problem during dev stage, it is better to reimage and retry.

Hope this helps. Let me know if you see/need change in this.  

 

2017-01-27 Posted by | Azure, Microservices, ServiceFabric | | Leave a comment

[Service Fabric] SF node fails to read DNS conf and fix

Recently SF developer reported this problem where his Azure Service Fabric > ImageStoreService(ISS) displayed a warning message due to one his secondary node down. This node was “down” all sudden without any change to his cluster/application. From the SF Explorer portal, we do noticed a brief warning message saying due to some unhealthy event, this node is down.

SF Explorer – warning

image

Error message

Unhealthy event: SourceId=’System.PLB’, Property=’ServiceReplicaUnplacedHealth_Secondary_00000000-0000-0000-0000-000000003000′, HealthState=’Warning’, ConsiderWarningAsError=false.
The Load Balancer was unable to find a placement for one or more of the Service’s Replicas:
ImageStoreService Secondary Partition 00000000-0000-0000-0000-000000003000 could not be placed, possibly, due to the following constraints and properties: 
TargetReplicaSetSize: 5
Placement Constraint: NodeTypeName==sf**type
Depended Service: ClusterManagerServiceName

Constraint Elimination Sequence:
ReplicaExclusionStatic eliminated 3 possible node(s) for placement — 2/5 node(s) remain.
PlacementConstraint + ServiceTypeDisabled/NodesBlockListed eliminated 0 + 1 = 1 possible node(s) for placement — 1/5 node(s) remain.
ReplicaExclusionDynamic eliminated 1 possible node(s) for placement — 0/5 node(s) remain.

Nodes Eliminated By Constraints:

ReplicaExclusionStatic — No Colocations with Partition’s Existing Secondaries/Instances:
FaultDomain:fd:/1 NodeName:_sf**type_1 NodeType:sf**type NodeTypeName:sf**type UpgradeDomain:1 UpgradeDomain: ud:/1 Deactivation Intent/Status: None/None
FaultDomain:fd:/4 NodeName:_sf**type_4 NodeType:sf**type NodeTypeName:sf**type UpgradeDomain:4 UpgradeDomain: ud:/4 Deactivation Intent/Status: None/None
FaultDomain:fd:/3 NodeName:_sf**type_3 NodeType:sf**type NodeTypeName:sf**type UpgradeDomain:3 UpgradeDomain: ud:/3 Deactivation Intent/Status: None/None

PlacementConstraint + ServiceTypeDisabled/NodesBlockListed — PlacementProperties must Satisfy Service’s PlacementConstraint, and Nodes must not have had the ServiceType Disabled or be BlockListed due to Node’s Pause/Deactivate Status:
FaultDomain:fd:/2 NodeName:_sf**type_2 NodeType:sf**type NodeTypeName:sf**type UpgradeDomain:2 UpgradeDomain: ud:/2 Deactivation Intent/Status: None/None

ReplicaExclusionDynamic — No Colocations with Partition’s Existing Primary or Potential Secondaries:

FaultDomain:fd:/0 NodeName:_sf**type_0 NodeType:sf**type NodeTypeName:sf**type UpgradeDomain:0 UpgradeDomain: ud:/0 Deactivation Intent/Status: None/None

Noticed below health warning event

There was no data from ‘sf**type_2’ post that failing date, so we confirmed both ISS and FabricDCA.exe was crashing
                          N/S RD sf**type_3 Up 131212…946953
                          N/S RD sf**type_1 Up 131212…593463858
                          N/S RD sf**type_4 Up 13121…..3859
                          N/I SB sf**type_2 Down 1312…..63860
                          N/P RD sf**type_0 Up 131212…..61

Event Log

Log Name:      Application
Source:        Microsoft-Windows-PerfNet
Date:          12/9/2016 7:42:40 AM
Event ID:      2005
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      sf***ype00xxx02
Description:
Unable to read performance data for the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information.

C0000466 00000000 634A41F0
Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          12/12/2016 11:29:13 AM
Event ID:      59904
Task Category: FabricDCA
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      sf***ype00xxx02

Description:
Failed to copy file D:\SvcFabLogPerformanceCounters_ServiceFabricPerfCounterfabric_counters_6361xxxxx47065191_000940.blg to Azure blob account sf***ype00xxx02, container fabriccounters-2b73743xxxxxxa46d111c4d5.

Microsoft.WindowsAzure.Storage.StorageException: The remote name could not be resolved: ‘sf***ype00xxx02.blob.core.windows.net’ —> System.Net.WebException: The remote name could not be resolved: ‘sf***ype00xxx02.blob.core.windows.net’
   at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
   at System.Net.HttpWebRequest.GetRequestStream()
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext)
   — End of inner exception stack trace —
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext)
   at Microsoft.WindowsAzure.Storage.Blob.CloudBlockBlob.UploadFromStreamHelper(Stream source, Nullable`1 length, AccessCondition accessCondition, BlobRequestOptions options, OperationContext operationContext)
   at FabricDCA.AzureFileUploader.CreateStreamAndUploadToBlob(String sourceFile, CloudBlockBlob destinationBlob)
   at FabricDCA.AzureFileUploader.CopyFileToDestinationBlobWorker(FileCopyInfo fileCopyInfo, CloudBlockBlob destinationBlob)
   at FabricDCA.AzureFileUploader.CopyFileToDestinationBlob(Object context)
   at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
   at FabricDCA.AzureFileUploader.CopyFileToDestination(String source, String sourceRelative, Int32 retryCount, Boolean& fileSkipped)

Check list tried:

  • we checked the free space details of drive “D:/” of this failing node – had enough space.
  • we were able to RDP into the machine (failing node) but could not able to browse any web sites or nslookup any url.
    1. >nslookup sf***ype00xxx02.blob.core.windows.net
    2.   server:unknown
    3.    *** unknown can’t find sf***ype00xxx02.blob.core.windows.net: No response from server
  • logs confirmed that, this particular failing node VM was going through some network related issue which is why it was not able to connect to any of the services and storage account.
  • there were no fabric log post this issue start date which also confirmed this vm lost its connectivity
  • checked any crash dump under D:SvcFabLogCrashDumps – no dumps
    checked the traces from D:SvcFabLogTraces – did not get any hint

Fix/resolution:

  • With above all findings, we confirmed this failing node:_sf***ype_2 was not resolving the DNS for some reason. This issue occurs very rarely due to corruption at OS level.
  • From the registry we see it has received the proper DNS settings from the azure DHCP server. 
  • The “HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParametersDhcpNameServer” was set to 16x.6x.12x.1x” but the affected machine was not able to read and use this DNS configuration due to which name resolution was broken at the operating system.
  • To overcome this issue, we ran “Netsh  int ip reset c:reset.log” and “netsh winsock reset catalog”  to reset the IP stack and windows socket catalog and rebooted the Virtual machine which eventually resolved this issue.

Reference article :https://support.microsoft.com/en-us/kb/299357

Let me know if this helps in someway.

2016-12-18 Posted by | Azure, ServiceFabric, VMSS | , , | Leave a comment

%d bloggers like this: