We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

5.00/5 (Submit Your Rating)

Fort Worth, TX

SUMMARY

  • Over 7+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Well versed with HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.
  • Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Sqoop, Hive and Kafka.
  • Experience with Data warehousing and Data mining, using one or more NoSQL Databases like HBase, Cassandra, and Mongo DB.
  • Design BI Applications in Tableau, QlikView, SSIS, SSRS, SSAS, OBIEE, Cognos, Informatica.
  • Part of the Agile BI / ETL Team and attend regular User Meeting to go through the requirement for the Data / BI Sprints. Highly Visible Data flow, Dashboards and reports are created based on the User Stories
  • Experience in using Sqoop to ingest data from RDBMS to HDFS.
  • Experience in Cluster Coordination using Zookeeper and worked on File Formats like Text, ORC, Avro and Parquet and compression techniques like Gzip and Zlib.
  • Experienced in using various Python libraries like NumPy, SciPy, python - twitter, Pandas.
  • Worked on visualization tools like Tableau for report creation and further analysis.
  • Experienced with Spark processing framework such as Spark SQL, andDataWarehousing and ETL processes.
  • Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine and imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs.
  • Experience with spark streaming and to write spark jobs.
  • Experience developing high throughput streaming applications from Kafka queues and writing enriched data back to outbound Kafka queues.
  • Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS.
  • Good understanding of AWS S3, EC2, Kinesis and Dynamo DB.
  • Used RStudio for data pre-processing and building machine learning algorithms on datasets.
  • Good Knowledge on NLP, Statistical Models, Machine Learning, Data Mining solutions to various business problems and generating using R, Python.
  • Experienced in real-time analytics with Spark RDD, Data Frames and Streaming API.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on Hive data.
  • Knowledge in integration of data from various sources like RDBMS, Spreadsheets, Text files,
  • Acquires good understanding of JIRA and maintaining JIRA dashboards.
  • Knowledge in using Java IDE’s like Eclipse and IntelliJ
  • Used Maven for building projects.
  • Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization.
  • Hands on experience on Hortonworks and Cloudera Hadoop environments.
  • Provided production support and involved with root cause analysis, bug fixing and promptly updating the business users on day-to-day production issues.
  • Developed Ad-hoc Queries for moving data from HDFS to HIVE and analysing the data using HIVE QL.
  • Involved in daily SCRUM meetings to discuss the development/progress of Sprints and was active in making scrum meetings more productive.
  • Developed data pipelines using ETL tools SQL Server Integration Services (SSIS), Microsoft Visual Studio (SSDT)
  • Experience in designing visualizations using Tableau software and storyline, publishing and presenting dashboards.
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
  • Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Experience in maintaining an Apache Tomcat MYSQL, LDAP, Web service environment.
  • Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
  • Good experience with use-case development, with Software methodologies like Agile and Waterfall.
  • Active team player with excellent interpersonal skills, keen learner with self-commitment& innovation.

TECHNICAL SKILLS

BigData/ Hadoop Technologies: MapReduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka,PySpark,, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting

NOSQL Databases: Cassandra, HBase, MongoDB, MariaDB

Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML

Development Tools: Microsoft SQL Studio, IntelliJ,Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader,PostgreSql, Talend,Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: MS SQL Server, MySQL, Oracle, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Fort Worth, TX

Senior Data Engineer

Responsibilities:

  • Performed data analysis and developed analytic solutions.Data investigation to discover correlations / trends and the ability to explain them.
  • Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design.
  • Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting, and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data warehouse to perform some data aggregations for business Intelligence using python.
  • Worked on different data formats such as JSON, XML.
  • Experience in cloud versioning technologies like GitHub.
  • Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means.
  • Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
  • Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
  • Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and StepFunctions.
  • Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3.
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment functionand configured it to receive events from S3 buckets.
  • Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
  • Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
  • Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
  • Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.

Environment: Hadoop, Map Reduce, HDFS, Hive, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, Pl/Sql, SQL, Kafka, Spark, Scala, Java, AWS, Azure, GitHub, Talend Big Data Integration, Solr, Impala, Unix, Shell Scripting.

Confidential, Bentonville, AR

Senior Data Engineer

Responsibilities:

  • Familiarity with Hive joins & used HQL for querying the databases eventually leading to complex Hive UDFs.
  • Installed OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
  • Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
  • Leveraged Chef to manage and maintain builds in various environments and planned for hardware and software installation on production cluster and communicated with multiple teams to get it done.
  • Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
  • Worked on Configuring Kerberos Authentication in the cluster.
  • Experience in creating tables, dropping, and altered at run time without blocking updates and queries using HBase and Hive.
  • Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
  • Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Worked on continuous Integration tools Jenkins and automated jar files at end of day.
  • Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
  • Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
  • Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWSElastic search.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Used Spark-SQL to Load JSON data and create Schema and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL,Scala,Data Frame, Impala,OpenShift, Talend,pair RDD's.
  • Setup data pipeline using in TDCH,Talend, Sqoop and PySparkon the basis on size of data loads
  • Implemented Real time analytics on Cassandra data using thrift API.
  • Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
  • Leading the testing efforts in support of projects/programs across a large landscape of technologies (Unix,AngularJS,AWS, Sause LABS, Cucumber JVM, Mongo DB, GitHub,Bitbucket,SQL, NoSQL database, API, Java, Jenkins.
  • Experience in using MapR File system, Ambari, Cloudera Manager for installation and management of Hadoop Cluster.
  • Worked on writing Scala Programs using Spark/Spark-SQL in performing aggregations.
  • Developed Web Services in play framework using Scala in building stream data Platform.
  • Worked with data modelers to understand financial data model and provided suggestions to the logical and physical data model.
  • Perform Table partitioning, monthly & yearly data Archival activities.
  • Developing python scripts for Redshift CloudWatch metrics data collection and automating the datapoints to redshift database.
  • Developed scripts for loading application call logs to S3 and used AWS Glue ETL to load into Redshift for data analytics team
  • Installing IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND).
  • Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
  • Provide troubleshooting and best practices methodology for development teams.
  • This includes process automation and new application onboarding.
  • Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
  • Configuring IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND) for user work-load distribution.
  • Multiple batch jobs were written for processing hourly and daily data received through multiple sources like Adobe, No-SQL databases.
  • Testing the processed data through various test cases to meet the business requirements.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Design data solutions for Enterprise Data Warehouse using ETL and ELT methodologies
  • Interact with business stakeholders from various teams such as Finance, Marketing, e-Commerce etc., understand their analytical and business needs define metrics and translate to BI solutions

Environment: ClouderaCDH5.13, Ambari, IBM Web Sphere,Hive, Python, HBase, Spark, Scala, Map Reduce, HDFS, Sqoop, AWS, Flume, Linux, Shell Scripting, Tableau,UNIX,Kafka,SQL,No-SQL.

Confidential, Plano, TX

Big Data Engineer

Responsibilities:

  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
  • Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
  • Built APIs that will allow customer service representatives to access the data and answer queries.
  • Designed changes to transform current Hadoop jobs to HBase.
  • Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
  • Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
  • Extending the functionality of Hive with custom UDF s and UDAF's.
  • The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
  • Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
  • Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
  • Expert in creating Hive UDFs using Java to analyze the data efficiently.
  • Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
  • Implemented AJAX, JSON, and Java script to create interactive web screens.
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reducethen stored into HDFS.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
  • Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
  • Utilized Waterfall methodology for team and project management
  • Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
  • Responsible for daily communications to management and internal organizations regarding status of all assigned projects and tasks.
  • Executed quantitative analysis on chemical products to recommend effective combinations
  • Performed statistical analysis using SQL, Python, R Programming and Excel.
  • Worked extensively with Excel VBA Macros, Microsoft Access Forms
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Manipulated and summarized data to maximize possible outcomes efficiently
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Analyzed and recommended improvements for better data consistency and efficiency
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.

Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica,HBase,MapReduce, HDFS, Sqoop, Impala, SQL,Tableau, Python,SAS, Flume, Oozie, Linux

Confidential, Charlotte, NC

Hadoop Developer

Responsibilities:

  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
  • Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive.
  • Headed negotiations to find optimal solutions with project teams and clients.
  • Mapped client business requirements to internal requirements of trading platform products
  • Supported revenue management using statistical and quantitative analysis, developed several statistical approaches and optimization models.
  • Led the business analysis team of four members, in absence of the Team Lead.
  • Added value by providing innovative solutions and delivering improved upon methods of data presentation by focusing on the Business need and the Business Value of the solution. Worked for Internet Marketing - Paid Search channels.
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders.
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Worked with stakeholders to communicate campaign results, strategy, issues or needs.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
  • Analyzed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
  • Developed testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
  • Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
  • GUI prompts user to enter personal information, charity items to donate, and deliver options.
  • Developed a fully functioning C# program that connects to SQL Server ManagementStudio and integrates information users enter with preexisting information in the database.
  • Implemented SQL functions to receive user information from front end C# GUIs and store it into database.
  • Utilized SQL functions to select information from database and send it to the front end upon user request.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop Log files.
  • Used Sqoop to transfer data between relational databases and Hadoop.
  • Worked on HDFS to store and access huge datasets within Hadoop.Good hands-on experience with GitHub.

Environment: Spark, Java,Python,Jenkins,HDFS,Sqoop, Hadoop 2.0, Kafka, JSON, Hive

Confidential, Irving, TX

Data Analyst

Responsibilities:

  • Worked on analyzing Hadoop cluster and different big data analytic tools including Hive and Sqoop.
  • Develop data pipeline using Sqoop and MapReduce to ingest current data and historical data in data staging area.
  • Responsible for defining data flow in Hadoop ecosystem to different teams.
  • Wrote Pig scripts for data cleansing and data transformation as ETL tool before loading in HDFS.
  • Worked on importing normalize data from staging area to HDFS using Scoop and perform analysis using Hive Query Language (HQL).
  • Create Managed tables and External tables in Hive and load data from HDFS.
  • Performed query optimization for HiveQL and denormalized Hive tables to increase speed of data retrieval.
  • Transferred analyzed data from HDFS to BI team for visualization and to data scientist team for predictive modelling.
  • Experience in scheduling workflows using Autosys.
  • Experience in running Hive queries on Spark execution engine.
  • Design whole SDLC of the Project and high level and detail deign plan.
  • Create different SAS reports like bar charts, tabular reports, cross tab reports etc. using SAS Web Report Studio and Create pages and Portlets in SAS information delivery Portal.
  • Publish the reports in SAS Information Delivery Portal and give access to different group of users.
  • Improving the project quality in terms of performance and the related documentation.
  • Performed Impact Analysis of the changes done to the existing mappings and provided the feedback.
  • Participated in providing the project estimates for development team efforts for the offshore as well as on-site.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Coordinated and monitored the project progress to ensure the timely flow and complete delivery of the project.
  • Demonstrable experience designing and implementing complex applications and distributed systems into public cloud infrastructure (AWS, GCP, Azure, etc…)
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics, Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala.
  • Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL datawarehouse environment.

Environment: Linux, MapReduce, YARN, Spark 1.4.1, Eclipse, Core Java, Oozie Workflows, AWS, S3, EMR, Cloudera, HBASE, SQOOP, Scala, Kafka, Python, Cassandra, maven, Hortonworks, Cloudera, SAS, SQL, Data Stage, Autosys, Oracle, Informix

We'd love your feedback!