Bigdata Engineer Resume
Philadelphia, PA
SUMMARY
- 7+ years of experience in Application development/Architecture and Data Analytics with specialization inWeb Applications and Client - Server, Big data applications with Java and Big Data Technologiesand expertise inJava, Scala, Python, Spark, Hadoop Map Reduce, Pig, Hive, Oozie, Sqoop,andNoSQL Databases.
- Strong experience inframeworks like Spring MVC, Spark, Struts, Hibernate, Spark, HBase, Hive, Hadoop, Sqoop, HDFS. Expertise in developingSparkcode usingScalaandSpark-SQL/Streamingfor faster testing and data processing.
- Experience usingSqoopto import data intoHDFSfromRDBMS. Experienced with Spark improving the performance and optimization of the existing algorithms inHadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Excellent knowledge ofHadoop Architectureand ecosystems such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, and Map Reduceprogramming paradigm. Good understanding of cloud configuration inAmazon web services (AWS).
- Experienced working withHadoop Big Data technologies(HDFS and MapReduce programs),Hadoopecho systems (HBase, Hive), andNoSQLdatabaseMongoDB
- I am experienced in usingNoSQLdatabase column orientedHBase, Hive. Extensive experience working with semi/unstructured data by implementing complexmap-reduceprograms usingdesign patterns.
- Knowledge inBI/DW solution (ETL, OLAP, Datamart), Informatica, BI Reporting tools like Tableau and QlikView. Expertise in loading the data from the different Data sources like (Teradata and DB2) intoHDFSusingSqoopand load into partitionedHivetables.
- I am experienced inHadoopcluster maintenance, including data andmetadatabackups, file system checks, commissioning and decommissioning nodes, and upgrades.
- Wrotepythonscripts to manage AWS resources from API calls usingBOTOSDK and worked withAWS CLI.
- In-depth Knowledge ofAWScloud services likeCompute, Network, Storage, andIdentity & access management.
- Hands-on Experience in configuration network architecture on AWS withVPC,Subnets,Internet gateway,NAT, Route table.
- Significant expertise analyzing massive data sets, producing Pig scripts and Hive queries, and substantial experience working with structured data utilizing Hive QL, joining operations, writing custom UDFs, and optimizing Hive Queries.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models using Data warehouse techniques, developingData Mining and reporting solutionsthat scale across a massive volume of structured and unstructured data.
- I am experienced in importing and exporting data usingSqoopfromHDFStoRelational Databaseand have expertise in job workflow scheduling and monitoring tools likeOozieTWS (Tivoli Workload Scheduler).
- Involve in integrating the applications with tools like Team City, GitLab, Bit bucket, and JIRA for issues and story tracking.
- Extensive experience inSQL, Stored Procedures, Functions, andTriggerswith databases such asOracle,IBM DB2, andMS SQLServer.
- Familiar with theAngular2and Typescript to build the Components in the application.Hands-on experience in system and network administration and Knowledge of cloud technology and distributed computing
- Experience in developing Spark applications using Spark-SQL inDatabricksfor data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover customer usage patterns.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azureservices (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)and processing the data inAzure Databricks
- Experienced in Automating, Configuring, and deploying instances onAWS, Azure environments, and Data centers, also familiar withEC2,Cloud watch,Cloud Formation, and managing security groups onAWS.
- Private Cloud Environment- LeveragingAWSandPuppetto provision internal computer systems for various clients rapidly.
- DevelopPuppet modulesand role/profiles for installing and configuring software required for various applications/blueprints.
- Wrotepythonscripts to manage AWS resources from API calls usingBOTOSDK and worked withAWS CLI.
- Excellent communication, presentation, and interpersonal skills. Ability to work independently or in a group with minimal supervision to meet deadlines.
TECHNICAL SKILLS
Big Data: Apache Spark, Hadoop, HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Flume, HBase, YARN, Cassandra, Phoenix, Airflow
Frameworks: Hibernate, Spring, Cloudera CDs, Hortonworks HDPs, MAPR
Programming & Scripting Languages: Java, Python, R, C, C++, HTML, JavaScript, XML, Git
Database: Oracle 10g/11g, PostgreSQL, DB2, SQL Server, MySQL, Redshift
NoSQL Database: HBase, Cassandra, MongoDB
IDE: Eclipse, Net beans, Maven, STS (Spring Tool Suite), Jupyter Notebook
ETL Tools: Pentaho, Informatica, Talend
Reporting Tool: Tableau, PowerBI
Operating Systems: Windows, UNIX, Linux, Sun Solaris
Testing Tools: Junit, MRUnit
AWS: EMR, Glue, Athena, Dynamo DB, Redshift, RDS, Data Pipelines, Lake formation, S3, IAM, CloudFormation, EC2, ELB/CLB.
Azure: Data Lakes, Data Factory, SQL Data warehouse, Data Lake Analytics, Databricks, other azure services.
PROFESSIONAL EXPERIENCE
Confidential, Philadelphia, PA
Bigdata Engineer
Responsibilities:
- Prepared design blueprints and application flow documentation, gathering requirements from the Business patterns.
- Expertise in MicrosoftAzure Cloud Services(PaaS & IaaS), Application Insights, Document DB, Internet of MYYThings (IoT),Azure Monitoring, Key Vault, Visual Studio Online (VSO), and SQL Azure.
- Hands-on experience inAzureDevelopment, worked onAzure web application,App services,Azure storage,Azure SQL Database,Virtual machines,Fabric controller,Azure AD, Azure search, and notification hub.
- Designed, configured, and deployed MicrosoftAzurefor many applications utilizing theAzurestack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
- Worked on data pre-processing and cleaning to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, directed Machine Learning use cases under Spark ML and Mllib.
- Maintained the data in Data Lake (ETL), coming from the Teradata Database, writing on an average of 80 GB daily. Overall, the data warehouse had 5 PB of data and used a 135-node cluster to process the data.
- Responsible for creating Hive Tables to load the data from MySQL by using Sqoop, writing java snippets to perform cleaning, pre-processing, and data validation.
- Experienced in creating Hive schema, external tables, and managing views. Worked on performing Join operations in Spark using hive. Writing HQL statements as per the user requirements.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and working with Spark-Shell. Developed Spark code using Java and Spark-SQL for faster testing and data processing.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in parquet format.
- Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
- To process the massive volume of structured data, Spark SQL was used. In addition, spark Data Frames transformations and steps to migrate Map Reduce algorithms were implemented.
- I was exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- I used Data Frame API solutions to pre-process massive volumes of structured data in various file formats, including Text files, CSV, Sequence files, XML and JSON files, and Parquet files, and then turn the data into named columns.
- Ensure necessary system security by using best-in-class AWS cloud security solutions. Additionally, I am experienced in deploying Java projects using Maven/ANT and Jenkins.
- DevOps and CI/CD pipeline knowledge - Mainly Teamcity, Selenium. Implement continuous integration/ delivery (CI/CD) pipelines in AWS when necessary.
- I am experienced with batch processing of data sources using Apache Spark. I am developing predictive analytics using Apache Spark Java APIs. Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Java.
- Worked on the core and Spark SQL modules of Spark extensively. Extensively used Broadcast Variables and Accumulators for better performances.
Environment: Hadoop, HDFS, Hive, Java 1.7, Spark 1.6, SQL, HBase, UNIX Shell Scripting, MapReduce, Putty, WinSCP, IntelliJ, Teradata, Linux.
Confidential, Orlando, FL
Big Data Engineer
Responsibilities:
- Apache Spark for data analytics such as filter, join enrichment, Spark SQL, Spark Streaming, etc.
- Installed & configured multi-node Hadoop cluster for data store & processing.
- Used SSMS to access, configure, administer all components of SQL Server, Azure SQL Database, and Azure Synapse Analytics. Developed Hive UDFs and Pig UDFs using Python in Microsoft HDInsight environment
- Importing and exporting data into HDFS, HBase, and Hive using Sqoop. Involved in various projects related to Data Modeling, System/Data Analysis, Design, and Development for both OLTP and Data warehousing environments.
- Design, develop, test, deploy, maintain, and improve data integration pipeline objects developed using Apache Spark / Pyspark / Python or Scala.
- Managed the development and performance of SQL databases for web applications, businesses, and organizations using SQL server management studio (SSMS).
- Engineered, designed, developed, and advanced Query Processing and Self-Tuning functionality using Synapse SQL. Demonstrated strength in data modeling, ETL development, and data warehousing
- Load and transform large sets of structured and semi-structured data. Implemented solutions using Hadoop. Assisted Management with tool and metrics development, data and interpretation and analysis, and process improvement in PHP
- Solid familiarity with Azure's analytics stack - Data Lake, Data Explorer/Kusto, Storage, Data Factory, Synapse, Data Bricks, HDInsight.
- I worked with engineers, developers, and QA to develop current and future applications related to the content management line of business.
- I Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows. In addition, I implemented ML programs to analyze large datasets in the warehouse for BI purposes.
- Worked on Implementing a log producer that watches for application logs, transforms incremental records, and sends them to a Kafka and Zookeeper. Wrapper developed in Python to run this application along with other applications
- Used Talend most used components (tMap, tDie, tConvertType, tFlowMeter, tLogCatcher, tRowGenerator, tSetGlobalVar, tHashInput & tHashOutput and many more).
- Worked on various Talend components such as tMap, tFilterRow, tAggregateRow, tFileExist, tFileCopy, tFileList, tDie etc.
- Experienced in building automation using Jenkins, Maven, ANT. Created a de-normalized Big Query Schema for analytical and reporting requirements
- Worked on Installing and configuring the HDP Hortonworks 2.x and Cloudera (CDH 5.5.1) Clusters in Dev and Production Environments
- Worked on Capacity planning for the Production Cluster
- Installed HUE Browser.
- Involved in loading data from UNIX file system to HDFS using Sqoop.
- Involved in creating Hive tables, loading the data and writing hive queries which will run internally in map reduce way.
- Worked on Installation of HORTONWORKS 2.1 in AZURE Linux Servers.
- Worked on Configuring Oozie Jobs.
- Worked on Configuring High Availability for Name Node in HDP 2.1.
- Worked on Configuring Kerberos Authentication in the cluster.
- Worked on cluster upgradation in Hadoop from HDP 2.1 to HDP 2.3.
- Worked on Configuring queues in capacity scheduler.
- Worked on installing and configuring Solr 5.2.1 in Hadoop cluster.
- Worked on taking Snapshot backups for HBase tables.
- Worked on SnowSQL and Snowpipe
- Converted Talend Joblets to support the snowflake functionality.
- Created Snowpipe for continuous data load.
- Created Talend Mappings to populate the data into dimensions and fact tables.
- Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using java and Talend.
- Used Talend big data components like Hadoop and S3 Buckets and AWS Services for redshift.
- Wrote, maintained, reviewed and documented modules, manifests, Git repositories for Puppet Enterprise on RHEL and Windows platforms.
- Worked, managed Ansible Playbooks with Ansible roles, group variables, inventory files, copy and remove files on remote systems using file module.
- Administered and supported GitHub Enterprise version control tool. Set-up databases in GCP using RDS, storage using S3 bucket and configuring instance backups to S3 bucket. prototype CI/CD system with GitLab on GKE utilizing kubernetes and Docker for the runtime environment for the CI/CD systems to build and test and deploy.
- Work with orchestration tools like Terraform, Chef and leverage modern tools like Vault, Consul, Kubernetes, Docker, Kafka, etc.
- Hybrid Environment between VMware/Azure/Google Cloud Platform.
Environment: Scala 10.4, Apache Spark 1.6.2, Apache Hadoop 2.6, HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, HBase, Apache Jboss 6.1 server, Oracle DB 10g, Kafka, Tableau, Talend Data Integration 6.1/5.5.1, Talend Enterprise Big Data Edition 5.5.1, Talend Administrator Console, AWS cloud services, ETL Pipelines, Jenkins, Maven, Agile scrum, Google cloud platform.
Confidential, Boston, MA
Bigdata Engineer
Responsibilities:
- Involve in all phases of SDLC (Software Development Life Cycle), including requirement collection, design, analysis, development, and application deployment.
- Architecture and design of business requirements and making Visio Diagrams for designing and developing the application and deploying the application in various environments.
- Develop Spark 2.1/2.4 Scala component to process the business logic and store the computation results of 10 TB data into HBase database to access the downstream web apps using Big SQL db2 database.
- Uploaded and processed more than ten terabytes of data from various structured and unstructured sources into HDFS using Sqoop and Flume. Test the developed modules in the application using Junit Library and Junit Testing Framework
- Analyze structured, unstructured data, and file system data and load the data to HBase tables based on the project requirement using IBM Big SQL with Sqoop mechanism and processing the data using Spark SQL in-memory computation &processing results to Hive, HBase
- Handle integrating additional enterprise data into HDFS using JDBC, loading Hadoop in Big SQL, and performing transformations on the fly using Spark API to develop the standard learner data model, which obtains data from upstream in near real-time persists into HBase.
- I am working with different file structures with other Hive file formats like Text file, Sequence file, ORC file, Parquet, and Avro to analyze the data to build a data model and read them from HDFS and process through parquet files and loading into HBASE tables.
- Develop the Batch jobs using Scala programming language to process the data from files and tables, transform the data with the business logic and deliver it to the user
- Work in the Continuous Deployment module, which is used to create new tables or update the existing table structure if needed in different environments, along with DDL (Data Definition Language) creation for the tables. Also, I wrote AZUREPOWERSHELLscripts to copy or move data from the local file system to HDFS Blob storage.
- Created Pipelines inADF (Azure Data Factory)usingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
- Developed JSON Scripts for deploying the Azure Data Factory (ADF) Pipeline that processes the data.
- Loading data from Linux/Unix file system to HDFS and working with PUTTY for the better communication between Unix and Window system and for accessing the data files in the Hadoop environment;
- Developed and Implemented HBase capabilities for Big de-normalized data set and then applied transformation on the de-normalized data set using Spark/Scala.
- Involved Spark tuning to improve the Jobs performance based on the Pepper Data monitoring tool metrics. Worked in building application platforms in the Cloud by leveraging Azure Databricks
- Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Ability to apply the spark DataFrame API to complete Data manipulation within spark session
- Import Data from various systems/sources like MYSQL into HDFS
- Involving on creating Table and then applied HiveQL on those tables for Data validation
- Develop shell scripts for configuration checks and files transformation, which is to be done before loading the data into the Hadoop Landing area HDFS
- Developed and Implement Spark ETL custom component to extract the data from upstream systems and push the data to HDFS and finally store the data in HBase with wide row format Work with Apache Hadoop environment by Hortonworks
- Enhance the application with new features and make the performance improvement in all the modules of the application
- Exposure to Microsoft Azure in the processing of migrating the on-prem data to Azure Cloud, studying and implementing spark techniques like partitioning the data with Keys, and writing it to parquet files which boost speed improvement.
- Understand the Mapping documents existing Source Data and prepare load strategies for different source systems, and implement them using Hadoop technology
- Worked with Continuous integration tools like maven, Team city, IntelliJ and scheduled the jobs with TWS (Tivoli Workload Scheduler) tool, Creating and cloning the jobs and Job streams in the TWS tool promoting them to higher environments.
- Contributed to the DevOps to automate the build and deployment process using Jenkins, shell scripting, chef, Python, AWS Lambda, Cloud Formation Template.
- Built on-premise data pipelines using Kafka and Spark streaming using the feed from API streaming Gateway REST service.
- Implemented the application using Spring Boot Framework and handled the security using Spring Security.
- Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers and also worked with Kafka Cluster using ZooKeeper.
- I coordinated with co-developers, agile development and project management team, and external systems and was responsible for demos presentation of developed modules to the project management team.
- Perform Code review activities with peer developers in the team and architect of the team for delivering exception/error-free application, testing it with real-time test scenarios, and deploying it to the next level environments. Analyze and fix the data processing issues and troubleshoot the process.
Environment: Scala, Java, Spark framework, Linux, Jira, Bitbucket, IBM Big SQL, Hive, HBase, IntelliJ IDEA, Maven, Db2 Visualizer, ETL, TeamCity, WinSCP, PuTTY, IBM TWS (Tivoli Workload Scheduler), Windows, Azure Data Factory, Linux.
Confidential
Software Engineer
Responsibilities:
- Analyzed requirements and created detailed Technical Design documents. Studied functional specifications and reviewed changes.
- XML creates data transfer logic from other formats to an XML file for the billing module. Oracle database is used to design Database schema create the Database structure, Tables, and Relationship diagrams.
- Web Sphere 4.0 is used as the application server. Developing JSP's for front end, developing Servlets and Session Beans in the middle tier
- Wrote the test cases for the Payment module.
- Designed and developed modules DCB and Data transmission. Migration of hardcoded Account numbers to Database
Environment: Java, J2EE, JSP, Servlets, JavaScript, Custom Tags, JDBC, XML, JAXB, Oracle, Sybase, Web Sphere 4.0 Application Server, Log4j, VSS, Windows NT
Confidential
Big Data Engineer
Responsibilities:
- I Built Real-time streaming and ingestion platform for an Event to publish and subscribe to events.
- I Worked on several use cases such as AutoPay, Flexloan, ATM Debit Card Request, Salesforce Integration (Address update), Zelle Email address and Phone number Update, and Negative data sharing.
- Initiated Data Governance for real-time events by designing and Implementing Schema Registry (Avro format)
- EAP (Enterprise Application platform) Maintenance and real-time and batch data persistence.
- Tokenization Implementation using DTAAS, Previtaar for supporting REST Level, file-level, field-level encryption.
- Kafka Connects AWS S3 Sink Connect, Salesforce Connect, Splunk Connect, and HDFS Connect.
- CMP (Market place) Integration with Kafka Admin API. Schema Registry API's development and Implemented to PROD. Message level encryption for an event at all levels.
- Developed Producer and Consumer SDK and imported the libraries to JFROG Artifactory. Connecting Kafka servers using SSL and SASL Connectivity.
- Persisting data in EAP (HDFS), creating hive tables, and storing the data in Hbase. Persisting data in S3 using Kinesis Firehouse, VPC, EMR, Lambda, and Cloud Watch.
- Mongo DB persistence and maintenance. I monitor Kafka, performance, and Health checks using App Dynamics.
- Log monitoring and Info monitoring using Splunk. I Built real-time dashboards using Grafana.
- Onboarding APIM, PSG, OAUTH, SSO, and channel ID for the applications and APIs deployed in PCF. Spark Streaming integration with Kafka POC.